anijain2305 commented 1 year ago

Dashboard to track the performance of different backends.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

anijain2305 commented 1 year ago

Compilation Profile

The tables show the worst 50 models for different metrics

Compilation Latency

dtype=float32, unit=seconds ~~~ +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ | suite | name | batch_size | pytorch | eager | aot_eager | aot_nvfuser | inductor_cudagraphs | +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ | huggingface | MobileBertForMaskedLM | 16 | 0.0 | 67.728 | 78.263 | 139.766 | 426.422 | | huggingface | MobileBertForQuestionAnswering | 32 | 0.0 | 66.85 | 78.347 | 138.941 | 521.547 | | torchbench | densenet121 | 4 | 0.0 | 3.646 | 7.496 | 77.281 | 599.583 | | torchbench | mobilenet_v2_quantized_qat | 96 | 0.0 | 3.431 | 7.875 | 70.466 | -2.83 | | torchbench | timm_efficientdet | 1 | 0.0 | 65.366 | 65.409 | 65.612 | -4.579 | | timm_models | res2net50_14w_8s | 128 | 0.0 | 3.708 | 8.399 | 57.811 | 428.275 | | timm_models | res2net101_26w_4s | 64 | 0.0 | 5.39 | 10.482 | 56.042 | 459.323 | | timm_models | res2next50 | 128 | 0.0 | 1.946 | 4.426 | 52.489 | 292.475 | | timm_models | legacy_senet154 | 32 | 0.0 | 5.437 | 11.439 | 51.941 | 335.474 | | torchbench | mobilenet_v3_large | 32 | 0.0 | 0.797 | 1.958 | 48.049 | 325.854 | | timm_models | gluon_inception_v3 | 128 | 0.0 | 1.879 | 4.232 | 46.426 | 472.191 | | timm_models | inception_v3 | 128 | 0.0 | 1.872 | 4.206 | 46.343 | 476.239 | | torchbench | resnet50_quantized_qat | 32 | 0.0 | 2.366 | 6.501 | 45.732 | -2.707 | | timm_models | adv_inception_v3 | 128 | 0.0 | 0.523 | 2.818 | 45.161 | 462.487 | | huggingface | XLNetLMHeadModel | 4 | 0.0 | 13.315 | 21.675 | 38.803 | 598.55 | | huggingface | MT5ForConditionalGeneration | 2 | 0.0 | 13.459 | 18.646 | 37.11 | 380.514 | | timm_models | gluon_xception65 | 32 | 0.0 | 1.946 | 5.08 | 34.752 | 226.31 | | torchbench | mobilenet_v2 | 96 | 0.0 | 0.542 | 1.57 | 33.984 | 281.45 | | huggingface | MegatronBertForCausalLM | 2 | 0.0 | 17.26 | 21.586 | 33.516 | 459.281 | | huggingface | MegatronBertForQuestionAnswering | 8 | 0.0 | 16.706 | 21.48 | 33.387 | 598.832 | | timm_models | selecsls42b | 128 | 0.0 | 0.586 | 1.621 | 31.551 | 239.668 | | timm_models | nasnetalarge | 16 | 0.0 | 30.009 | 31.092 | 31.029 | -3.241 | | torchbench | resnet50 | 32 | 0.0 | 0.765 | 1.95 | 29.284 | 159.181 | | torchbench | mnasnet1_0 | 32 | 0.0 | 0.611 | 1.694 | 27.733 | 273.082 | | huggingface | T5ForConditionalGeneration | 4 | 0.0 | 7.773 | 11.156 | 27.079 | 266.568 | | huggingface | DebertaV2ForMaskedLM | 1 | 0.0 | 7.919 | 12.989 | 26.708 | -1.222 | | torchbench | hf_T5 | 8 | 0.0 | 7.205 | 10.728 | 26.68 | 234.116 | | huggingface | XGLMForCausalLM | 2 | 0.0 | 6.336 | 10.992 | 26.274 | 598.983 | | huggingface | T5Small | 1 | 0.0 | 7.806 | 11.216 | 26.261 | 280.438 | | huggingface | DebertaV2ForQuestionAnswering | 1 | 0.0 | 7.909 | 12.99 | 25.893 | -1.24 | | huggingface | M2M100ForConditionalGeneration | 2 | 0.0 | 6.279 | 12.378 | 25.882 | 598.719 | | torchbench | resnext50_32x4d | 8 | 0.0 | 0.773 | 1.949 | 25.454 | 141.483 | | huggingface | PegasusForConditionalGeneration | 4 | 0.0 | 6.235 | 11.986 | 24.865 | 582.318 | | timm_models | pnasnet5large | 16 | 0.0 | 22.702 | 24.632 | 23.986 | -3.202 | | huggingface | YituTechConvBert | 1 | 0.0 | 7.199 | 10.91 | 22.998 | 338.889 | | torchbench | LearningToPaint | 96 | 0.0 | 0.419 | 0.85 | 22.746 | 107.36 | | huggingface | GPTNeoForSequenceClassification | 1 | 0.0 | 7.333 | 12.292 | 21.577 | -1.179 | | torchbench | shufflenet_v2_x1_0 | 128 | 0.0 | 0.885 | 2.245 | 21.15 | 190.619 | | huggingface | GPTNeoForCausalLM | 1 | 0.0 | 7.335 | 12.174 | 21.138 | -1.16 | | torchbench | hf_BigBird | 2 | 0.0 | 8.239 | 12.636 | 20.964 | -1.487 | | huggingface | BigBird | 1 | 0.0 | 8.339 | 12.634 | 20.934 | -1.432 | | huggingface | BlenderbotSmallForConditionalGeneration | 64 | 0.0 | 5.518 | 9.435 | 20.71 | 274.667 | | timm_models | hrnet_w18 | 128 | 0.0 | 18.748 | 20.474 | 20.252 | -3.612 | | huggingface | DebertaForMaskedLM | 4 | 0.0 | 4.385 | 7.539 | 19.249 | 172.45 | | torchbench | Background_Matting | 4 | 0.0 | 0.025 | 0.986 | 18.871 | 135.819 | | huggingface | DebertaForQuestionAnswering | 4 | 0.0 | 4.343 | 7.558 | 18.614 | -1.077 | | timm_models | dpn107 | 32 | 0.0 | 17.261 | 17.818 | 17.805 | -2.842 | | huggingface | ElectraForCausalLM | 1 | 0.0 | 5.238 | 7.486 | 17.657 | 276.946 | | huggingface | CamemBert | 1 | 0.0 | 5.146 | 7.451 | 17.356 | -0.978 | | huggingface | LayoutLMForMaskedLM | 16 | 0.0 | 5.342 | 7.731 | 17.14 | 214.194 | +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ ~~~

Peak Memory

dtype=float32, unit=GB ~~~ +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ | suite | name | batch_size | pytorch | eager | aot_eager | aot_nvfuser | inductor_cudagraphs | +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ | torchbench | vgg16 | 64 | 0.0 | 0.0 | 3.148 | 3.147 | 1.005 | | torchbench | hf_T5 | 8 | 0.0 | 0.0 | 1.749 | 2.566 | 3.397 | | timm_models | res2next50 | 128 | 0.0 | 0.0 | 1.415 | 2.101 | 5.326 | | timm_models | res2net50_14w_8s | 128 | 0.0 | 0.0 | 1.572 | 2.036 | 4.705 | | huggingface | BlenderbotSmallForCausalLM | 64 | 0.0 | 0.0 | 1.916 | 1.92 | 4.27 | | huggingface | AlbertForMaskedLM | 2 | 0.0 | 0.0 | 0.954 | 1.844 | 1.231 | | timm_models | gluon_inception_v3 | 128 | 0.0 | 0.0 | 2.006 | 1.816 | 2.53 | | timm_models | adv_inception_v3 | 128 | 0.0 | 0.0 | 2.006 | 1.816 | 2.528 | | timm_models | inception_v3 | 128 | 0.0 | 0.0 | 2.006 | 1.816 | 2.529 | | huggingface | BlenderbotSmallForConditionalGeneration | 64 | 0.0 | 0.0 | 1.664 | 1.668 | 4.141 | | huggingface | AlbertForQuestionAnswering | 2 | 0.0 | 0.0 | 0.705 | 1.595 | 0.697 | | timm_models | gluon_xception65 | 32 | 0.0 | 0.0 | 0.908 | 1.546 | 0.327 | | huggingface | XLNetLMHeadModel | 4 | 0.0 | 0.0 | 1.514 | 1.531 | -10.373 | | torchbench | hf_Albert | 8 | 0.0 | 0.0 | 0.356 | 1.459 | -0.749 | | huggingface | BartForCausalLM | 4 | 0.0 | 0.0 | 1.227 | 1.244 | 4.418 | | timm_models | res2net101_26w_4s | 64 | 0.0 | 0.0 | 0.848 | 1.111 | 2.468 | | timm_models | legacy_senet154 | 32 | 0.0 | 0.0 | 0.989 | 1.106 | 0.095 | | torchbench | hf_Bart | 4 | 0.0 | -0.0 | 1.026 | 1.035 | 1.541 | | huggingface | LayoutLMForMaskedLM | 16 | 0.0 | 0.0 | 1.0 | 1.0 | 2.144 | | huggingface | BertForMaskedLM | 64 | 0.0 | 0.0 | 1.0 | 0.975 | 2.107 | | huggingface | T5ForConditionalGeneration | 4 | 0.0 | 0.0 | 0.736 | 0.944 | 2.519 | | torchbench | timm_nfnet | 128 | 0.0 | 0.891 | 0.89 | 0.89 | -13.257 | | timm_models | dm_nfnet_f0 | 128 | 0.0 | 0.891 | 0.89 | 0.89 | -13.257 | | torchbench | Background_Matting | 4 | 0.0 | -0.03 | 0.586 | 0.865 | 0.999 | | huggingface | MBartForCausalLM | 16 | 0.0 | 0.0 | 0.819 | 0.82 | 3.195 | | huggingface | MegatronBertForQuestionAnswering | 8 | 0.0 | 0.0 | 0.797 | 0.797 | -3.993 | | huggingface | TrOCRForCausalLM | 8 | 0.0 | 0.0 | 0.75 | 0.75 | 2.531 | | torchbench | pytorch_struct | 200 | 0.0 | 0.0 | 0.682 | 0.682 | 0.05 | | torchbench | resnet50 | 32 | 0.0 | 0.0 | 0.438 | 0.673 | 1.107 | | huggingface | LayoutLMForSequenceClassification | 16 | 0.0 | 0.025 | 0.885 | 0.658 | 0.847 | | timm_models | selecsls42b | 128 | 0.0 | 0.076 | 0.695 | 0.649 | 1.965 | | torchbench | pytorch_unet | 1 | 0.0 | -0.0 | 0.623 | 0.567 | 0.667 | | huggingface | MT5ForConditionalGeneration | 2 | 0.0 | 0.0 | 0.622 | 0.536 | 3.445 | | huggingface | T5Small | 1 | 0.0 | 0.0 | 0.372 | 0.532 | 1.144 | | huggingface | MobileBertForQuestionAnswering | 32 | 0.0 | 0.0 | 0.084 | 0.502 | 0.78 | | huggingface | PLBartForCausalLM | 16 | 0.0 | 0.0 | 0.485 | 0.486 | 1.604 | | huggingface | ElectraForQuestionAnswering | 64 | 0.0 | 0.0 | 0.716 | 0.448 | -0.436 | | torchbench | hf_Bert | 4 | 0.0 | 0.0 | 0.496 | 0.447 | 1.195 | | huggingface | CamemBert | 1 | 0.0 | -0.003 | 0.445 | 0.447 | -1.415 | | huggingface | RobertaForQuestionAnswering | 64 | 0.0 | 0.0 | 0.444 | 0.443 | 0.78 | | huggingface | BertForQuestionAnswering | 64 | 0.0 | 0.0 | 0.444 | 0.443 | 0.779 | | huggingface | Speech2Text2ForCausalLM | 64 | 0.0 | 0.101 | 0.428 | 0.433 | 1.004 | | torchbench | LearningToPaint | 96 | 0.0 | 0.021 | 0.358 | 0.401 | 0.54 | | huggingface | YituTechConvBert | 1 | 0.0 | 0.0 | 0.382 | 0.39 | 1.458 | | torchbench | hf_DistilBert | 8 | 0.0 | 0.0 | 0.484 | 0.373 | 0.943 | | torchbench | shufflenet_v2_x1_0 | 128 | 0.0 | 0.0 | 0.266 | 0.37 | 0.378 | | huggingface | MobileBertForMaskedLM | 16 | 0.0 | 0.0 | 0.25 | 0.352 | 0.97 | | torchbench | mnasnet1_0 | 32 | 0.0 | 0.0 | 0.149 | 0.3 | 0.358 | | huggingface | DistillGPT2 | 1 | 0.0 | 0.003 | 0.408 | 0.29 | 1.164 | | timm_models | convmixer_768_32 | 32 | 0.0 | 0.0 | 0.179 | 0.265 | 0.154 | +-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+ ~~~

Number of graphs

dtype=float32, unit=graphs ~~~ +-------------+-----------------------------------+------------+--------+ | suite | name | batch_size | graphs | +-------------+-----------------------------------+------------+--------+ | huggingface | DebertaV2ForMaskedLM | 1 | 304.0 | | huggingface | DebertaV2ForQuestionAnswering | 1 | 304.0 | | huggingface | DebertaForMaskedLM | 4 | 204.0 | | huggingface | DebertaForQuestionAnswering | 4 | 204.0 | | huggingface | BigBird | 1 | 64.0 | | torchbench | hf_BigBird | 2 | 64.0 | | timm_models | convit_base | 32 | 27.0 | | huggingface | GoogleFnet | 1 | 27.0 | | torchbench | hf_Reformer | 4 | 22.0 | | timm_models | densenet121 | 64 | 14.0 | | torchbench | moco | 32 | 11.0 | | huggingface | PegasusForConditionalGeneration | 4 | 7.0 | | torchbench | fastNLP_Bert | 6 | 10.0 | | huggingface | M2M100ForConditionalGeneration | 2 | 7.0 | | torchbench | tts_angular | 64 | 4.0 | | torchbench | speech_transformer | 32 | 4.0 | | huggingface | Speech2Text2ForCausalLM | 64 | 4.0 | | huggingface | XGLMForCausalLM | 2 | 4.0 | | huggingface | PegasusForCausalLM | 8 | 4.0 | | timm_models | crossvit_9_240 | 64 | 2.0 | | timm_models | eca_botnext26ts_256 | 128 | 2.0 | | timm_models | gluon_xception65 | 32 | 2.0 | | timm_models | gluon_senet154 | 32 | 2.0 | | timm_models | gluon_inception_v3 | 128 | 2.0 | | timm_models | ghostnet_100 | 128 | 2.0 | | timm_models | gernet_l | 128 | 2.0 | | timm_models | fbnetv3_b | 128 | 2.0 | | timm_models | fbnetc_100 | 128 | 2.0 | | timm_models | ese_vovnet19b_dw | 128 | 2.0 | | timm_models | adv_inception_v3 | 128 | 2.0 | | timm_models | ecaresnet101d | 64 | 2.0 | | timm_models | eca_halonext26ts | 128 | 2.0 | | timm_models | beit_base_patch16_224 | 64 | 2.0 | | huggingface | LayoutLMForSequenceClassification | 16 | 2.0 | | timm_models | botnet26t_256 | 128 | 2.0 | | timm_models | cait_m36_384 | 2 | 2.0 | | timm_models | coat_lite_mini | 128 | 2.0 | | timm_models | dpn107 | 32 | 2.0 | | timm_models | dm_nfnet_f0 | 128 | 2.0 | | timm_models | convmixer_768_32 | 32 | 2.0 | | timm_models | dla102 | 64 | 2.0 | | timm_models | gmlp_s16_224 | 64 | 2.0 | | timm_models | deit_base_distilled_patch16_224 | 64 | 2.0 | | huggingface | GPT2ForSequenceClassification | 4 | 2.0 | | timm_models | cspdarknet53 | 64 | 2.0 | | huggingface | GPTNeoForSequenceClassification | 1 | 2.0 | | timm_models | convnext_base | 32 | 2.0 | | timm_models | gmixer_24_224 | 64 | 2.0 | | timm_models | xcit_large_24_p8_224 | 5 | 2.0 | | timm_models | res2next50 | 128 | 2.0 | +-------------+-----------------------------------+------------+--------+ ~~~

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+-------------+-------------+-------------+
|    Compiler    | torchbench  | huggingface | timm_models |
+----------------+-------------+-------------+-------------+
|     eager      | 100%, 55/55 | 93%, 41/44  | 100%, 61/61 |
|   aot_eager    | 98%, 54/55  | 93%, 41/44  | 90%, 55/61  |
| aot_cudagraphs | 29%, 16/55  |  0%, 0/44   |  0%, 0/61   |
|  aot_nvfuser   | 62%, 34/55  |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 87%, 48/55  | 77%, 34/44  | 74%, 45/61  |
+----------------+-------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   1.12x    |    1.12x    |    1.12x    |
|    inductor    |   1.38x    |    1.60x    |    1.23x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.68    |    13.69    |    11.39    |
|   aot_eager    |   10.31    |    20.58    |    17.02    |
| aot_cudagraphs |    4.47    |     0.0     |     0.0     |
|  aot_nvfuser   |   21.51    |    10.59    |    57.77    |
|    inductor    |   278.25   |   120.52    |   427.42    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.87x    |    0.88x    |    0.88x    |
| aot_cudagraphs |   0.48x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   0.84x    |    1.08x    |    0.85x    |
|    inductor    |   0.79x    |    0.74x    |    0.90x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 0.9976 | 1.0092 | 0.0 | 1.4538 | 4.5603 | | timm_efficientdet | 1 | 0.9817 | 0.8908 | 0.0 | 0.0 | 3.8319 | | functorch_dp_cifar10 | 64 | 1.0004 | 0.9835 | 0.0 | 1.2001 | 3.7742 | | timm_vision_transformer | 8 | 0.9983 | 0.9452 | 0.0 | 1.3452 | 2.5363 | | drq | 1 | 1.0117 | 0.826 | 0.0 | 1.0725 | 2.4186 | | BERT_pytorch | 16 | 1.0094 | 0.8856 | 0.0 | 0.0 | 2.03 | | resnet18 | 16 | 1.0049 | 1.1155 | 0.0 | 1.3986 | 1.7819 | | pytorch_struct | 200 | 0.9963 | 0.7395 | 0.8854 | 0.8963 | 1.7657 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9955 | 0.9348 | 1.1291 | 1.1909 | 1.7242 | | lennard_jones | 1000 | 0.974 | 0.8405 | 1.0627 | 1.0207 | 1.7135 | | hf_Albert | 8 | 1.0013 | 0.9978 | 0.0 | 0.0 | 1.6628 | | squeezenet1_1 | 32 | 1.0006 | 1.0042 | 1.0435 | 1.1661 | 1.6351 | | dcgan | 32 | 0.9954 | 1.02 | 1.088 | 1.1569 | 1.6229 | | resnext50_32x4d | 8 | 1.0027 | 1.0793 | 0.0 | 1.3534 | 1.5568 | | speech_transformer | 32 | 1.003 | 0.8984 | 0.0 | 0.0 | 1.4906 | | timm_nfnet | 128 | 0.9995 | 0.9997 | 0.0 | 1.2116 | 1.4697 | | mobilenet_v3_large | 32 | 1.0053 | 1.121 | 0.0 | 1.3848 | 1.4662 | | hf_GPT2 | 4 | 1.0053 | 0.9748 | 0.0 | 0.0 | 1.4228 | | hf_T5_large | 2 | 1.0242 | 0.8958 | 0.0 | 0.0 | 1.4145 | | soft_actor_critic | 256 | 0.9952 | 0.7978 | 1.0393 | 1.0108 | 1.3816 | | fastNLP_Bert | 6 | 0.999 | 0.9749 | 0.0 | 0.0 | 1.3503 | | pytorch_unet | 1 | 0.9996 | 0.9969 | 0.0 | 1.0758 | 1.2042 | | LearningToPaint | 96 | 1.0045 | 1.0546 | 0.0 | 1.2423 | 1.2032 | | hf_Bart | 4 | 1.0118 | 0.974 | 0.0 | 0.0 | 1.1751 | | Super_SloMo | 6 | 0.9999 | 0.9977 | 0.0 | 0.0 | 1.1742 | | vgg16 | 64 | 1.0 | 0.9986 | 0.7923 | 0.9962 | 1.1703 | | hf_Bert | 4 | 1.0269 | 0.9881 | 0.0 | 0.0 | 1.1642 | | alexnet | 128 | 0.9984 | 0.9988 | 0.777 | 1.0007 | 1.162 | | mnasnet1_0 | 32 | 1.001 | 1.1017 | 0.7035 | 1.3033 | 1.1612 | | hf_DistilBert | 8 | 0.9997 | 0.9542 | 0.0 | 0.0 | 1.1537 | | Background_Matting | 4 | 0.9996 | 1.0229 | 0.0 | 1.08 | 1.1159 | | pytorch_stargan | 16 | 0.9994 | 0.9836 | 0.7288 | 0.9873 | 1.1151 | | hf_Reformer | 4 | 0.9963 | 0.0 | 0.8939 | 0.0 | 1.1098 | | hf_BigBird | 2 | 0.985 | 0.9444 | 0.0 | 0.0 | 1.0887 | | shufflenet_v2_x1_0 | 128 | 1.0011 | 1.0504 | 0.0 | 1.1836 | 1.0756 | | timm_efficientnet | 32 | 0.9543 | 0.816 | 0.0 | 1.0788 | 1.0728 | | timm_vision_transformer_large | 8 | 0.9992 | 0.9936 | 0.0 | 0.9822 | 1.0534 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.9708 | 0.0 | 0.0 | 1.0469 | | timm_resnest | 32 | 0.9996 | 1.0033 | 0.0 | 1.1829 | 1.0289 | | tts_angular | 64 | 0.9959 | 0.9672 | 0.9836 | 0.9982 | 1.0112 | | demucs | 4 | 1.0003 | 1.0002 | 0.9997 | 1.0006 | 1.0 | | mobilenet_v2_quantized_qat | 96 | 0.9992 | 0.9996 | 0.999 | 0.9988 | 0.999 | | resnet50_quantized_qat | 32 | 0.9975 | 0.9984 | 0.9983 | 0.9987 | 0.9984 | | dlrm | 2048 | 0.9692 | 0.9785 | 0.0 | 0.0 | 0.9604 | | mobilenet_v2 | 96 | 0.9993 | 0.9979 | 0.0 | 1.0437 | 0.9574 | | timm_vovnet | 32 | 0.9073 | 0.9025 | 0.0 | 1.0018 | 0.91 | | nvidia_deeprecommender | 256 | 0.9993 | 0.9629 | 0.5845 | 0.9425 | 0.9044 | | moco | 32 | 0.9947 | 1.0484 | 0.0 | 0.0 | 0.7591 | | timm_regnet | 32 | 0.9652 | 0.9636 | 0.0 | 1.0932 | 0.7378 | | resnet50 | 32 | 0.9987 | 0.9933 | 0.0 | 1.161 | 0.7127 | | yolov3 | 16 | 0.9996 | 0.9945 | 0.0 | 1.1838 | 0.0 | | hf_Longformer | 2 | 0.969 | 0.899 | 0.8164 | 0.0 | 0.0 | | hf_T5 | 8 | 0.9985 | 0.9942 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9996 | 0.9801 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9791 | 0.8546 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | mobilenet_v2_quantized_qat | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | resnet50_quantized_qat | 2 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | fail_to_run | | hf_Longformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 50.9164 | 70.3788 | nan | nan | 1855.9202 | | densenet121 | 4 | 13.1067 | 25.4059 | nan | 101.5015 | 1599.4226 | | hf_T5_large | 2 | 35.7166 | 66.5562 | nan | nan | 1154.4563 | | mnasnet1_0 | 32 | 3.1383 | 7.0386 | 23.5784 | 33.4187 | 924.0881 | | mobilenet_v3_large | 32 | 3.6197 | 7.569 | nan | 55.8228 | 815.6827 | | moco | 32 | 11.4915 | 16.8868 | nan | nan | 792.5782 | | mobilenet_v2 | 96 | 3.069 | 6.6873 | nan | 39.0419 | 673.3655 | | resnext50_32x4d | 8 | 3.3393 | 7.3876 | nan | 31.0213 | 626.7237 | | timm_efficientnet | 32 | 5.8246 | 10.4379 | nan | 56.7643 | 573.6101 | | shufflenet_v2_x1_0 | 128 | 3.6097 | 8.0917 | nan | 29.4511 | 415.0044 | | squeezenet1_1 | 32 | 0.6275 | 1.3124 | 3.1679 | 4.8972 | 379.8186 | | timm_resnest | 32 | 1.351 | 3.4723 | nan | 36.2388 | 362.8361 | | timm_regnet | 32 | 8.274 | 14.2127 | nan | 53.5289 | 335.4974 | | attention_is_all_you_need_pytorch | 256 | 4.2332 | 10.1412 | nan | nan | 269.6108 | | speech_transformer | 32 | 7.1452 | 13.5568 | nan | nan | 259.7565 | | timm_vovnet | 32 | 2.8909 | 6.1661 | nan | 25.6462 | 255.4935 | | functorch_dp_cifar10 | 64 | 0.7904 | 2.0933 | nan | 5.6355 | 208.4064 | | timm_vision_transformer | 8 | 2.9873 | 6.3471 | nan | 11.3264 | 200.3176 | | resnet18 | 16 | 0.9362 | 2.4353 | nan | 18.0277 | 195.5902 | | timm_vision_transformer_large | 8 | 22.2765 | 34.0841 | nan | 44.7332 | 189.7259 | | Background_Matting | 4 | 3.6941 | 7.5331 | nan | 32.8015 | 183.8065 | | BERT_pytorch | 16 | 4.8027 | 10.7586 | nan | nan | 183.4356 | | LearningToPaint | 96 | 0.9741 | 2.5194 | nan | 24.5849 | 178.7819 | | resnet50 | 32 | 3.2773 | 7.4179 | nan | 35.0054 | 175.1635 | | hf_Bart | 4 | 7.0179 | 13.1991 | nan | nan | 163.5884 | | fastNLP_Bert | 6 | 5.0044 | 9.9284 | nan | nan | 153.2715 | | hf_GPT2 | 4 | 3.4005 | 7.8867 | nan | nan | 149.3391 | | timm_nfnet | 128 | 6.6204 | 11.8892 | nan | 34.5324 | 136.2766 | | pytorch_stargan | 16 | 0.8038 | 2.764 | 9.5008 | 4.2834 | 128.577 | | pytorch_struct | 200 | 0.3903 | 0.9288 | 1.4439 | 4.2379 | 106.0572 | | Super_SloMo | 6 | 2.1703 | 5.8559 | nan | nan | 93.2015 | | hf_Bert | 4 | 4.9761 | 9.5568 | nan | nan | 82.4714 | | hf_Albert | 8 | 1.1045 | 5.7238 | nan | nan | 80.7431 | | hf_Reformer | 4 | 2.9996 | nan | 13.0539 | nan | 77.5866 | | pytorch_unet | 1 | 1.0533 | 2.812 | nan | 20.2914 | 64.8277 | | hf_BigBird | 2 | 10.9112 | 16.7734 | nan | nan | 61.4215 | | hf_DistilBert | 8 | 1.57 | 3.9675 | nan | nan | 54.3823 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.7416 | 2.5789 | 7.9558 | 4.1967 | 35.9234 | | vgg16 | 64 | 0.3047 | 0.7761 | 2.3197 | 2.7169 | 22.3816 | | dlrm | 2048 | 0.5927 | 0.9744 | nan | nan | 18.9655 | | drq | 1 | 0.2629 | 0.5431 | nan | 3.5281 | 18.6718 | | alexnet | 128 | 0.2274 | 0.5093 | 1.2017 | 2.4621 | 17.8106 | | dcgan | 32 | 0.2487 | 0.5048 | 1.23 | 3.8262 | 16.923 | | nvidia_deeprecommender | 256 | 0.2588 | 0.4766 | 0.7503 | 2.4694 | 12.556 | | soft_actor_critic | 256 | 0.2557 | 0.3887 | 0.5963 | 1.5565 | 12.2199 | | lennard_jones | 1000 | 0.2225 | 0.3672 | 0.5077 | 1.1334 | 5.8665 | | tts_angular | 64 | 0.3106 | 0.3618 | 0.4935 | 1.0926 | 4.687 | | resnet50_quantized_qat | 32 | 2.5256 | 2.4992 | 2.5283 | 2.4885 | 2.434 | | mobilenet_v2_quantized_qat | 96 | 2.4664 | 2.4212 | 2.3653 | 2.3407 | 2.3672 | | demucs | 4 | 0.8026 | 0.8012 | 0.8095 | 0.8135 | 0.7167 | | yolov3 | 16 | 7.1951 | 13.031 | nan | 47.4947 | nan | | hf_Longformer | 2 | 11.5662 | 18.8685 | 84.9383 | nan | nan | | hf_GPT2_large | 4 | 21.0635 | 34.9334 | nan | nan | nan | | tacotron2 | 64 | 13.5662 | 26.3055 | nan | nan | nan | | hf_T5 | 8 | 3.7864 | 10.4607 | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | Super_SloMo | 6 | 1.0024 | 0.956 | nan | nan | 1.1855 | | timm_efficientnet | 32 | 0.9998 | 0.7704 | nan | 0.7845 | 1.0652 | | timm_nfnet | 128 | 0.9393 | 0.897 | nan | 0.9515 | 1.022 | | timm_efficientdet | 1 | 1.0142 | 0.8251 | nan | nan | 1.0218 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9967 | 0.9967 | 0.9967 | 1.0001 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.9957 | 0.9957 | 0.9957 | 0.9992 | | mobilenet_v2 | 96 | 0.9993 | 0.7661 | nan | 0.7676 | 0.9975 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.984 | 0.9884 | 0.9842 | | hf_GPT2 | 4 | 0.9548 | 0.887 | nan | nan | 0.9505 | | Background_Matting | 4 | 1.0026 | 0.952 | nan | 0.9773 | 0.9139 | | pytorch_stargan | 16 | 0.9975 | 1.019 | 0.2027 | 1.0085 | 0.9023 | | speech_transformer | 32 | 0.9988 | 0.9152 | nan | nan | 0.8959 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.2326 | 0.9114 | 0.8941 | | hf_Albert | 8 | 0.9333 | 0.9333 | nan | nan | 0.8804 | | pytorch_unet | 1 | 0.9985 | 0.8536 | nan | 0.851 | 0.859 | | hf_Bart | 4 | 0.9617 | 0.878 | nan | nan | 0.853 | | hf_Bert | 4 | 0.9683 | 0.8952 | nan | nan | 0.8517 | | timm_regnet | 32 | 1.0013 | 0.8634 | nan | 0.8806 | 0.8481 | | shufflenet_v2_x1_0 | 128 | 1.0 | 0.9163 | nan | 0.8868 | 0.8447 | | fastNLP_Bert | 6 | 1.0012 | 0.9152 | nan | nan | 0.8343 | | attention_is_all_you_need_pytorch | 256 | 0.9481 | 0.9241 | nan | nan | 0.8261 | | timm_vovnet | 32 | 0.9933 | 0.7644 | nan | 0.7778 | 0.8252 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8237 | | hf_BigBird | 2 | 0.9609 | 0.9609 | nan | nan | 0.8205 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.2781 | 0.9742 | 0.8159 | | hf_DistilBert | 8 | 0.9212 | 0.9053 | nan | nan | 0.7841 | | dcgan | 32 | 1.0 | 0.7784 | 0.3321 | 0.7784 | 0.767 | | moco | 32 | 1.0067 | 0.9701 | nan | nan | 0.7668 | | alexnet | 128 | 0.9998 | 0.7731 | 0.3805 | 0.7736 | 0.743 | | mnasnet1_0 | 32 | 0.9988 | 0.9087 | 0.1627 | 0.8348 | 0.7268 | | resnet50 | 32 | 1.0002 | 0.8763 | nan | 0.8011 | 0.7254 | | timm_vision_transformer_large | 8 | 1.0022 | 0.8433 | nan | 0.8015 | 0.7222 | | timm_vision_transformer | 8 | 1.0 | 0.8883 | nan | 0.8108 | 0.712 | | mobilenet_v3_large | 32 | 0.9958 | 0.8655 | nan | 0.8773 | 0.7041 | | dlrm | 2048 | 0.7282 | 0.7283 | nan | nan | 0.6973 | | timm_resnest | 32 | 0.9935 | 0.8869 | nan | 0.8075 | 0.6862 | | densenet121 | 4 | 1.0 | 0.8812 | nan | 0.8571 | 0.6618 | | resnext50_32x4d | 8 | 0.9994 | 0.8687 | nan | 0.8223 | 0.6615 | | vgg16 | 64 | 1.0 | 0.6663 | 0.2532 | 0.6664 | 0.6471 | | LearningToPaint | 96 | 0.9442 | 0.6918 | nan | 0.6272 | 0.6444 | | soft_actor_critic | 256 | 0.964 | 0.964 | 0.4356 | 0.9555 | 0.6428 | | drq | 1 | 0.8541 | 0.8541 | nan | 0.8541 | 0.6427 | | resnet18 | 16 | 0.9846 | 0.7907 | nan | 0.7038 | 0.6163 | | lennard_jones | 1000 | 1.0 | 1.0 | 0.3712 | 1.0947 | 0.5646 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4734 | 0.5598 | 0.5598 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | | functorch_dp_cifar10 | 64 | 0.9626 | 0.8251 | nan | 0.8254 | 0.4037 | | hf_Reformer | 4 | 0.3011 | nan | 0.1803 | nan | 0.299 | | yolov3 | 16 | 1.0072 | 0.8533 | nan | 0.8915 | nan | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2879 | nan | nan | | tacotron2 | 64 | 0.9922 | 1.1046 | nan | nan | nan | | hf_T5 | 8 | 0.9527 | 0.9446 | nan | nan | nan | | hf_GPT2_large | 4 | 0.936 | 0.8771 | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | MT5ForConditionalGeneration | 2 | 1.0268 | 0.919 | 0.0 | 0.0 | 4.2611 | | ElectraForCausalLM | 1 | 1.0463 | 0.9209 | 0.0 | 0.0 | 4.1573 | | YituTechConvBert | 1 | 1.0326 | 0.9386 | 0.0 | 0.0 | 3.1049 | | MegatronBertForCausalLM | 2 | 1.043 | 0.943 | 0.0 | 0.0 | 2.8277 | | RobertaForCausalLM | 4 | 1.0398 | 0.9419 | 0.0 | 0.0 | 2.7707 | | MobileBertForMaskedLM | 16 | 1.0228 | 0.919 | 0.0 | 0.0 | 2.6211 | | M2M100ForConditionalGeneration | 2 | 1.1203 | 1.0476 | 0.0 | 0.0 | 2.584 | | OPTForCausalLM | 4 | 1.0181 | 0.9027 | 0.0 | 0.0 | 2.5794 | | XGLMForCausalLM | 1 | 1.0148 | 0.8733 | 0.0 | 0.0 | 2.4465 | | PegasusForConditionalGeneration | 4 | 1.0132 | 0.883 | 0.0 | 0.0 | 2.4111 | | MobileBertForQuestionAnswering | 32 | 1.0191 | 0.9141 | 0.0 | 0.0 | 2.3065 | | CamemBert | 1 | 1.046 | 0.945 | 0.0 | 0.0 | 2.2963 | | DistillGPT2 | 1 | 1.0351 | 0.9295 | 0.0 | 0.0 | 2.0155 | | PLBartForConditionalGeneration | 8 | 1.0177 | 0.8977 | 0.0 | 0.0 | 1.8483 | | GoogleFnet | 1 | 1.0022 | 0.8086 | 0.0 | 1.1178 | 1.7839 | | GPT2ForSequenceClassification | 4 | 0.9991 | 0.977 | 0.0 | 0.0 | 1.6644 | | MegatronBertForQuestionAnswering | 8 | 1.0461 | 0.9419 | 0.0 | 0.0 | 1.6081 | | MBartForConditionalGeneration | 8 | 1.0126 | 0.916 | 0.0 | 0.0 | 1.4634 | | XLNetLMHeadModel | 4 | 0.9991 | 0.9656 | 0.0 | 0.0 | 1.4289 | | PegasusForCausalLM | 8 | 1.0088 | 0.9262 | 0.0 | 0.0 | 1.3581 | | T5ForConditionalGeneration | 4 | 1.002 | 0.9661 | 0.0 | 0.0 | 1.349 | | TrOCRForCausalLM | 8 | 1.0149 | 0.9561 | 0.0 | 0.0 | 1.337 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9999 | 0.0 | 0.0 | 1.3032 | | AlbertForMaskedLM | 2 | 1.0008 | 0.9987 | 0.0 | 0.0 | 1.2986 | | Speech2Text2ForCausalLM | 64 | 1.0087 | 0.9398 | 0.0 | 0.0 | 1.2936 | | LayoutLMForSequenceClassification | 16 | 0.9991 | 0.9865 | 0.0 | 0.0 | 1.2471 | | T5Small | 1 | 1.0201 | 0.9507 | 0.0 | 0.0 | 1.2467 | | BartForConditionalGeneration | 1 | 1.0117 | 0.8919 | 0.0 | 0.0 | 1.2102 | | DistilBertForQuestionAnswering | 32 | 1.0287 | 0.9788 | 0.0 | 0.0 | 1.186 | | DebertaForQuestionAnswering | 4 | 0.9307 | 0.7473 | 0.7971 | 0.0 | 1.1787 | | DistilBertForMaskedLM | 16 | 1.0282 | 0.9804 | 0.0 | 0.0 | 1.1665 | | PLBartForCausalLM | 16 | 1.0148 | 0.9447 | 0.0 | 0.0 | 1.1599 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0103 | 0.9364 | 0.0 | 0.0 | 1.1574 | | BartForCausalLM | 2 | 0.9992 | 0.9654 | 0.0 | 0.0 | 1.1029 | | RobertaForQuestionAnswering | 64 | 0.9986 | 0.9812 | 0.0 | 0.0 | 1.1015 | | BertForQuestionAnswering | 64 | 0.9987 | 0.9812 | 0.0 | 0.0 | 1.0921 | | BigBird | 1 | 0.996 | 0.9401 | 0.0 | 0.0 | 1.0903 | | MBartForCausalLM | 16 | 1.0061 | 0.9666 | 0.0 | 0.0 | 1.0422 | | BertForMaskedLM | 64 | 0.9993 | 0.9612 | 0.0 | 0.0 | 1.0404 | | DebertaForMaskedLM | 4 | 0.9338 | 0.8099 | 0.7224 | 0.0 | 1.0183 | | BlenderbotSmallForCausalLM | 64 | 1.001 | 0.9056 | 0.0 | 0.0 | 1.0071 | | AllenaiLongformerBase | 1 | 0.9525 | 0.8694 | 0.7836 | 0.0 | 0.0 | | ElectraForQuestionAnswering | 64 | 0.9988 | 0.9837 | 0.0 | 0.0 | 0.0 | | LayoutLMForMaskedLM | 16 | 0.9989 | 0.9699 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | GoogleFnet | 1 | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BartForConditionalGeneration | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | M2M100ForConditionalGeneration | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 17.6114 | 35.8594 | nan | nan | 311.0562 | | MobileBertForMaskedLM | 16 | 134.9363 | 159.1989 | nan | nan | 285.2019 | | MobileBertForQuestionAnswering | 32 | 131.2332 | 154.9954 | nan | nan | 264.1833 | | M2M100ForConditionalGeneration | 2 | 25.6915 | 36.5483 | nan | nan | 249.2139 | | MT5ForConditionalGeneration | 2 | 6.4578 | 16.8657 | nan | nan | 202.2192 | | T5ForConditionalGeneration | 4 | 3.7498 | 10.6849 | nan | nan | 200.4614 | | MBartForConditionalGeneration | 8 | 26.4326 | 39.5871 | nan | nan | 185.0177 | | PegasusForConditionalGeneration | 4 | 25.99 | 38.2544 | nan | nan | 176.0357 | | BartForConditionalGeneration | 1 | 26.0076 | 38.7586 | nan | nan | 171.48 | | YituTechConvBert | 1 | 8.9211 | 16.6277 | nan | nan | 171.2474 | | XGLMForCausalLM | 1 | 14.9478 | 24.9402 | nan | nan | 165.6688 | | DebertaForMaskedLM | 4 | 6.9897 | 13.3816 | 50.5707 | nan | 159.4509 | | MegatronBertForCausalLM | 2 | 16.6098 | 26.1689 | nan | nan | 156.6464 | | T5Small | 1 | 3.849 | 10.5757 | nan | nan | 156.5866 | | MegatronBertForQuestionAnswering | 8 | 16.2237 | 26.0801 | nan | nan | 152.5495 | | PLBartForConditionalGeneration | 8 | 7.1207 | 13.3694 | nan | nan | 148.7384 | | BlenderbotSmallForConditionalGeneration | 32 | 11.9825 | 20.3854 | nan | nan | 134.417 | | DebertaForQuestionAnswering | 4 | 6.9424 | 13.2729 | 50.3621 | nan | 120.3692 | | RobertaForCausalLM | 4 | 5.0343 | 9.8896 | nan | nan | 108.708 | | LayoutLMForSequenceClassification | 16 | 5.2725 | 10.0526 | nan | nan | 102.2804 | | PegasusForCausalLM | 8 | 9.8065 | 14.5866 | nan | nan | 98.5429 | | MBartForCausalLM | 16 | 9.8899 | 14.5906 | nan | nan | 91.5006 | | OPTForCausalLM | 4 | 4.6313 | 9.4156 | nan | nan | 88.7484 | | BertForMaskedLM | 64 | 5.0508 | 9.6996 | nan | nan | 87.3848 | | BartForCausalLM | 2 | 9.8513 | 14.44 | nan | nan | 87.1208 | | GPT2ForSequenceClassification | 4 | 3.4283 | 7.9924 | nan | nan | 86.486 | | TrOCRForCausalLM | 8 | 9.7915 | 14.4556 | nan | nan | 78.9711 | | DistillGPT2 | 1 | 1.4429 | 3.7509 | nan | nan | 75.3892 | | ElectraForCausalLM | 1 | 5.088 | 9.8527 | nan | nan | 72.4505 | | PLBartForCausalLM | 16 | 3.2335 | 5.4969 | nan | nan | 70.0976 | | CamemBert | 1 | 4.996 | 9.9911 | nan | nan | 68.592 | | DistilBertForQuestionAnswering | 32 | 1.7309 | 4.1188 | nan | nan | 68.2232 | | Speech2Text2ForCausalLM | 64 | 3.1456 | 5.399 | nan | nan | 67.8609 | | BlenderbotSmallForCausalLM | 64 | 4.796 | 7.892 | nan | nan | 67.5137 | | RobertaForQuestionAnswering | 64 | 4.8999 | 9.8389 | nan | nan | 66.2968 | | BertForQuestionAnswering | 64 | 4.8621 | 9.7433 | nan | nan | 65.7811 | | AlbertForMaskedLM | 2 | 1.2227 | 6.2484 | nan | nan | 65.4464 | | BigBird | 1 | 11.1119 | 16.9625 | nan | nan | 58.5731 | | DistilBertForMaskedLM | 16 | 1.7176 | 4.1522 | nan | nan | 51.726 | | AlbertForQuestionAnswering | 2 | 1.2187 | 6.0422 | nan | nan | 45.4511 | | GoogleFnet | 1 | 1.9996 | 4.2959 | nan | 10.5864 | 44.8824 | | AllenaiLongformerBase | 1 | 11.654 | 19.7453 | 85.668 | nan | nan | | LayoutLMForMaskedLM | 16 | 5.3735 | 10.207 | nan | nan | nan | | ElectraForQuestionAnswering | 64 | 4.9237 | 9.746 | nan | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9342 | 0.9091 | nan | nan | 1.0318 | | XLNetLMHeadModel | 4 | 1.0001 | 0.8976 | nan | nan | 0.9717 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | nan | nan | 0.9339 | | BertForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8445 | | DistilBertForQuestionAnswering | 32 | 1.0 | 0.9046 | nan | nan | 0.8394 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | nan | nan | 0.8321 | | BartForCausalLM | 2 | 1.0 | 0.8847 | nan | nan | 0.8303 | | BigBird | 1 | 1.0001 | 0.9549 | nan | nan | 0.8224 | | DistilBertForMaskedLM | 16 | 0.9998 | 0.9138 | nan | nan | 0.8055 | | PLBartForCausalLM | 16 | 0.9997 | 0.8802 | nan | nan | 0.8028 | | MBartForCausalLM | 16 | 1.0 | 0.8629 | nan | nan | 0.8005 | | DistillGPT2 | 1 | 1.0003 | 0.7721 | nan | nan | 0.7997 | | Speech2Text2ForCausalLM | 64 | 1.0 | 0.88 | nan | nan | 0.7768 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.7754 | | XGLMForCausalLM | 1 | 0.9999 | 0.9999 | nan | nan | 0.7728 | | BartForConditionalGeneration | 1 | 1.0 | 0.8465 | nan | nan | 0.7708 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0 | 0.9036 | nan | nan | 0.7612 | | PLBartForConditionalGeneration | 8 | 0.9997 | 0.8222 | nan | nan | 0.7547 | | CamemBert | 1 | 0.998 | 0.7977 | nan | nan | 0.7369 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.7299 | | TrOCRForCausalLM | 8 | 1.0 | 0.8048 | nan | nan | 0.7284 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | nan | nan | 0.7277 | | MBartForConditionalGeneration | 8 | 1.0 | 0.8137 | nan | nan | 0.727 | | OPTForCausalLM | 4 | 0.9979 | 0.75 | nan | nan | 0.714 | | RobertaForCausalLM | 4 | 0.9058 | 0.7778 | nan | nan | 0.7099 | | PegasusForCausalLM | 8 | 1.0 | 0.9323 | nan | nan | 0.7012 | | MegatronBertForQuestionAnswering | 8 | 0.923 | 0.8265 | nan | nan | 0.6997 | | GoogleFnet | 1 | 1.0003 | 0.9447 | nan | 1.0813 | 0.6953 | | M2M100ForConditionalGeneration | 2 | 0.9783 | 0.9777 | nan | nan | 0.6688 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.7066 | nan | nan | 0.6453 | | PegasusForConditionalGeneration | 4 | 0.9721 | 0.9004 | nan | nan | 0.642 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | nan | nan | 0.6173 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9369 | nan | nan | 0.6126 | | ElectraForCausalLM | 1 | 1.0 | 0.9107 | nan | nan | 0.6123 | | AlbertForMaskedLM | 2 | 0.9999 | 0.9172 | nan | nan | 0.6027 | | MobileBertForMaskedLM | 16 | 0.9997 | 0.9179 | nan | nan | 0.5861 | | MobileBertForQuestionAnswering | 32 | 1.0 | 0.9716 | nan | nan | 0.4668 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.352 | nan | 0.4265 | | DebertaForQuestionAnswering | 4 | 0.9845 | 1.0525 | 0.3276 | nan | 0.3569 | | AllenaiLongformerBase | 1 | 0.9988 | 0.9515 | 0.3143 | nan | nan | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | nan | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | hrnet_w18 | 2 | 1.0063 | 1.0839 | 0.0 | 1.4454 | 4.4633 | | res2net50_14w_8s | 2 | 1.0006 | 1.0247 | 0.0 | 1.4422 | 4.1571 | | res2next50 | 2 | 1.0002 | 1.0372 | 0.0 | 1.3702 | 4.1399 | | coat_lite_mini | 128 | 0.9999 | 0.9989 | 0.0 | 1.0734 | 1.7041 | | ghostnet_100 | 128 | 0.9986 | 0.9941 | 0.0 | 1.243 | 1.6133 | | tnt_s_patch16_224 | 64 | 0.9995 | 0.9981 | 0.0 | 1.5571 | 1.5001 | | twins_pcpvt_base | 32 | 1.0054 | 0.9735 | 0.0 | 1.2889 | 1.4364 | | xcit_large_24_p8_224 | 5 | 1.0011 | 0.9919 | 0.0 | 0.0 | 1.4302 | | crossvit_9_240 | 64 | 1.0068 | 0.9963 | 0.0 | 1.062 | 1.4092 | | volo_d1_224 | 64 | 0.9996 | 0.9949 | 0.0 | 1.1385 | 1.4042 | | nfnet_l0 | 64 | 1.0001 | 0.797 | 0.0 | 1.0495 | 1.381 | | gmixer_24_224 | 64 | 0.9991 | 0.8429 | 0.0 | 0.9957 | 1.3504 | | jx_nest_base | 32 | 0.9992 | 0.9943 | 0.0 | 1.2244 | 1.2882 | | convit_base | 32 | 0.9991 | 0.995 | 0.0 | 1.1931 | 1.2615 | | lcnet_050 | 128 | 0.9547 | 0.9495 | 0.0 | 1.5025 | 1.2406 | | cait_m36_384 | 2 | 0.9979 | 0.9981 | 0.0 | 0.9962 | 1.2022 | | convnext_base | 32 | 0.9992 | 0.9967 | 0.0 | 1.0434 | 1.172 | | gmlp_s16_224 | 64 | 0.9991 | 0.996 | 0.0 | 0.9991 | 1.142 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9813 | 0.0 | 0.9537 | 1.1226 | | regnety_002 | 128 | 0.9787 | 0.9996 | 0.0 | 1.3613 | 1.1081 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.998 | 0.0 | 1.019 | 1.1058 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9982 | 0.0 | 0.9781 | 1.0981 | | mixer_b16_224 | 64 | 0.9996 | 0.9968 | 0.0 | 0.9838 | 1.0523 | | mixnet_l | 64 | 0.971 | 0.8727 | 0.0 | 1.0065 | 1.0458 | | tf_mixnet_l | 64 | 0.9718 | 0.8763 | 0.0 | 1.0061 | 1.0239 | | dpn107 | 32 | 0.9587 | 0.9505 | 0.0 | 1.0289 | 1.0034 | | dla102 | 64 | 0.9995 | 0.9965 | 0.0 | 1.2853 | 0.9963 | | resmlp_12_224 | 128 | 0.9997 | 0.9986 | 0.0 | 0.0 | 0.9746 | | resnest101e | 32 | 1.0033 | 1.0192 | 0.0 | 1.1978 | 0.9554 | | tf_efficientnet_b0 | 128 | 0.977 | 0.7833 | 0.0 | 0.9847 | 0.8973 | | repvgg_a2 | 128 | 0.9645 | 0.9628 | 0.0 | 1.1198 | 0.891 | | selecsls42b | 128 | 0.9994 | 0.9981 | 0.0 | 1.2083 | 0.8872 | | spnasnet_100 | 128 | 0.9614 | 0.9577 | 0.0 | 1.1368 | 0.886 | | visformer_small | 128 | 1.0 | 1.0012 | 0.0 | 1.0216 | 0.8772 | | fbnetv3_b | 128 | 0.965 | 0.9616 | 0.0 | 1.1289 | 0.8724 | | gernet_l | 128 | 0.9735 | 0.9722 | 0.0 | 1.0981 | 0.8702 | | mnasnet_100 | 128 | 0.9667 | 0.9638 | 0.0 | 1.1557 | 0.8485 | | mobilenetv3_large_100 | 128 | 0.965 | 0.9626 | 0.0 | 1.1636 | 0.8457 | | cspdarknet53 | 64 | 0.9583 | 0.9521 | 0.0 | 1.1839 | 0.8444 | | tinynet_a | 128 | 0.9667 | 0.776 | 0.0 | 0.9711 | 0.8364 | | mobilevit_s | 32 | 0.9725 | 0.7645 | 0.0 | 0.9873 | 0.8216 | | eca_botnext26ts_256 | 64 | 0.973 | 0.7708 | 0.0 | 1.0167 | 0.7978 | | sebotnet33ts_256 | 64 | 0.9759 | 0.8072 | 0.0 | 1.0536 | 0.7733 | | eca_halonext26ts | 64 | 0.9743 | 0.7761 | 0.0 | 1.0143 | 0.7709 | | fbnetc_100 | 128 | 0.9668 | 0.9628 | 0.0 | 1.1885 | 0.7567 | | res2net101_26w_4s | 64 | 0.9988 | 0.9967 | 0.0 | 1.1758 | 0.7474 | | rexnet_100 | 128 | 0.9724 | 0.8167 | 0.0 | 0.9834 | 0.676 | | mobilenetv2_100 | 128 | 0.9668 | 0.9633 | 0.0 | 1.0116 | 0.669 | | ese_vovnet19b_dw | 128 | 0.9789 | 0.9774 | 0.0 | 1.1447 | 0.6203 | | botnet26t_256 | 128 | 0.9859 | 0.9852 | 0.0 | 1.2245 | 0.0 | | dm_nfnet_f0 | 128 | 0.9993 | 0.9997 | 0.0 | 1.2107 | 0.0 | | adv_inception_v3 | 128 | 0.9998 | 0.9971 | 0.0 | 1.1256 | 0.0 | | gluon_inception_v3 | 128 | 1.0 | 0.9985 | 0.0 | 1.1248 | 0.0 | | inception_v3 | 128 | 0.9998 | 0.9968 | 0.0 | 1.1246 | 0.0 | | swsl_resnext101_32x16d | 32 | 0.9995 | 0.9987 | 0.0 | 1.108 | 0.0 | | pnasnet5large | 16 | 0.9988 | 0.9979 | 0.0 | 1.083 | 0.0 | | convmixer_768_32 | 32 | 1.0003 | 0.9997 | 0.0 | 1.061 | 0.0 | | pit_b_224 | 64 | 0.9998 | 0.9973 | 0.0 | 1.0594 | 0.0 | | gluon_xception65 | 32 | 0.9995 | 0.9967 | 0.0 | 1.0398 | 0.0 | | poolformer_m36 | 64 | 0.9994 | 0.9967 | 0.0 | 1.0063 | 0.0 | | swin_base_patch4_window7_224 | 64 | 0.9996 | 0.9715 | 0.0 | 1.003 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | convnext_base | 2 | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | | gluon_xception65 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 99.0925 | 129.1429 | nan | 302.8547 | 1305.5451 | | dpn107 | 32 | 13.3369 | 24.7075 | nan | 86.8976 | 1280.7288 | | rexnet_100 | 128 | 6.4069 | 11.8586 | nan | 107.4496 | 988.7111 | | res2net50_14w_8s | 2 | 19.6264 | 33.3971 | nan | 88.2353 | 936.7276 | | mobilevit_s | 32 | 5.7236 | 11.1465 | nan | 45.981 | 799.1906 | | mixnet_l | 64 | 13.2916 | 20.3439 | nan | 69.4935 | 778.2989 | | eca_botnext26ts_256 | 64 | 2.5909 | 6.1519 | nan | 49.7554 | 723.0911 | | ghostnet_100 | 128 | 9.0362 | 15.9147 | nan | 66.3032 | 685.6784 | | tinynet_a | 128 | 7.425 | 13.1769 | nan | 67.229 | 660.1354 | | fbnetv3_b | 128 | 12.7368 | 20.2478 | nan | 86.4999 | 650.6102 | | fbnetc_100 | 128 | 5.4761 | 10.6475 | nan | 49.3807 | 636.7924 | | twins_pcpvt_base | 32 | 25.3344 | 36.9167 | nan | 68.5287 | 623.4564 | | resnest101e | 32 | 26.2018 | 40.9862 | nan | 100.5019 | 610.1099 | | coat_lite_mini | 128 | 3.0107 | 7.0432 | nan | 16.4276 | 607.5914 | | res2net101_26w_4s | 64 | 25.6881 | 41.7492 | nan | 105.5198 | 529.815 | | res2next50 | 2 | 7.2984 | 14.4734 | nan | 48.4801 | 510.8523 | | dla102 | 64 | 10.5521 | 19.1407 | nan | 71.9313 | 507.7483 | | sebotnet33ts_256 | 64 | 3.8312 | 8.4416 | nan | 53.5113 | 491.4471 | | tf_mixnet_l | 64 | 13.42 | 20.5125 | nan | 70.1229 | 489.7827 | | cspdarknet53 | 64 | 6.0697 | 11.541 | nan | 51.9091 | 486.8412 | | mnasnet_100 | 128 | 4.1071 | 7.8077 | nan | 40.3807 | 437.7587 | | tf_efficientnet_b0 | 128 | 5.6858 | 10.6249 | nan | 65.6932 | 426.6445 | | eca_halonext26ts | 64 | 2.5793 | 6.4106 | nan | 51.7275 | 422.1906 | | regnety_002 | 128 | 4.7761 | 9.4998 | nan | 50.0005 | 380.0622 | | ese_vovnet19b_dw | 128 | 1.9265 | 4.0512 | nan | 31.7936 | 376.7527 | | convnext_base | 32 | 11.4469 | 15.8597 | nan | 30.6712 | 366.3008 | | mobilenetv2_100 | 128 | 3.9971 | 7.732 | nan | 40.4497 | 363.7838 | | spnasnet_100 | 128 | 5.3407 | 10.1369 | nan | 47.4009 | 351.9616 | | xcit_large_24_p8_224 | 5 | 37.1866 | 52.5417 | nan | nan | 332.3336 | | jx_nest_base | 32 | 9.9406 | 17.229 | nan | 66.5254 | 311.8953 | | mobilenetv3_large_100 | 128 | 4.3523 | 8.1262 | nan | 67.3167 | 311.1143 | | visformer_small | 128 | 2.3158 | 5.403 | nan | 25.7883 | 310.6265 | | cait_m36_384 | 2 | 47.2186 | 64.0945 | nan | 90.7984 | 298.0052 | | crossvit_9_240 | 64 | 7.4081 | 13.6019 | nan | 32.2106 | 266.0203 | | selecsls42b | 128 | 2.3164 | 5.4867 | nan | 42.0583 | 257.8308 | | gernet_l | 128 | 4.8222 | 9.2556 | nan | 39.347 | 251.3237 | | lcnet_050 | 128 | 1.9314 | 4.1819 | nan | 32.1152 | 232.143 | | volo_d1_224 | 64 | 6.5276 | 12.7236 | nan | 32.8592 | 182.781 | | convit_base | 32 | 3.8981 | 8.8328 | nan | 21.0897 | 177.8059 | | gmlp_s16_224 | 64 | 9.0879 | 14.1942 | nan | 21.4561 | 145.7858 | | tnt_s_patch16_224 | 64 | 12.1016 | 20.3575 | nan | 34.7907 | 143.4652 | | gmixer_24_224 | 64 | 8.4244 | 14.0469 | nan | 23.4351 | 135.0839 | | repvgg_a2 | 128 | 4.7534 | 9.049 | nan | 47.3708 | 128.289 | | nfnet_l0 | 64 | 5.8266 | 11.3992 | nan | 31.553 | 103.5121 | | resmlp_12_224 | 128 | 2.6977 | 5.0748 | nan | nan | 101.4346 | | mixer_b16_224 | 64 | 2.8858 | 5.2583 | nan | 13.4757 | 97.4849 | | deit_base_distilled_patch16_224 | 64 | 3.0426 | 6.3566 | nan | 13.0728 | 78.6364 | | beit_base_patch16_224 | 64 | 4.4735 | 8.5964 | nan | 18.2085 | 75.1216 | | vit_base_patch16_224 | 64 | 2.8552 | 6.5407 | nan | 11.5077 | 59.9998 | | pnasnet5large | 16 | 60.8211 | 79.9493 | nan | 183.1858 | nan | | inception_v3 | 128 | 8.3458 | 15.9807 | nan | 75.3239 | nan | | adv_inception_v3 | 128 | 8.5007 | 15.7832 | nan | 75.0367 | nan | | gluon_inception_v3 | 128 | 8.1377 | 16.0286 | nan | 74.6521 | nan | | swin_base_patch4_window7_224 | 64 | 11.8907 | 22.2574 | nan | 68.2608 | nan | | gluon_xception65 | 32 | 14.9179 | 24.5631 | nan | 55.7975 | nan | | swsl_resnext101_32x16d | 32 | 10.1223 | 18.5382 | nan | 49.2201 | nan | | botnet26t_256 | 128 | 2.287 | 5.453 | nan | 42.0242 | nan | | dm_nfnet_f0 | 128 | 6.4591 | 11.8769 | nan | 34.8338 | nan | | poolformer_m36 | 64 | 13.0015 | 19.6132 | nan | 34.8218 | nan | | convmixer_768_32 | 32 | 6.7607 | 11.8459 | nan | 19.5188 | nan | | pit_b_224 | 64 | 3.6984 | 7.7193 | nan | 15.3124 | nan | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 0.9992 | 0.9684 | nan | 0.9825 | 1.3808 | | nfnet_l0 | 64 | 1.0008 | 0.8298 | nan | 0.813 | 1.2558 | | tinynet_a | 128 | 1.0 | 0.7831 | nan | 0.7845 | 1.1735 | | eca_halonext26ts | 64 | 1.0 | 0.7717 | nan | 0.7731 | 1.1316 | | rexnet_100 | 128 | 0.9992 | 0.7879 | nan | 0.871 | 1.1072 | | convit_base | 32 | 1.0001 | 0.8879 | nan | 0.9506 | 1.068 | | mobilenetv2_100 | 128 | 0.9998 | 0.7664 | nan | 0.7679 | 1.0051 | | mobilevit_s | 32 | 0.9999 | 0.7692 | nan | 0.7431 | 1.0012 | | dla102 | 64 | 0.9881 | 0.9181 | nan | 0.9541 | 1.001 | | eca_botnext26ts_256 | 64 | 1.0 | 0.7705 | nan | 0.7679 | 0.9703 | | tf_mixnet_l | 64 | 1.0001 | 0.861 | nan | 0.8605 | 0.9698 | | cait_m36_384 | 2 | 1.0001 | 0.9024 | nan | 0.9202 | 0.9451 | | tf_efficientnet_b0 | 128 | 0.9998 | 0.7727 | nan | 0.8426 | 0.9413 | | mixer_b16_224 | 64 | 0.9956 | 0.9615 | nan | 0.8644 | 0.9357 | | beit_base_patch16_224 | 64 | 1.0 | 0.9575 | nan | 0.8606 | 0.9272 | | gmlp_s16_224 | 64 | 1.0 | 0.9766 | nan | 0.966 | 0.9267 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9469 | nan | 0.8229 | 0.915 | | tnt_s_patch16_224 | 64 | 1.0001 | 0.9752 | nan | 0.8518 | 0.9131 | | volo_d1_224 | 64 | 0.9999 | 0.9247 | nan | 0.7472 | 0.9124 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9476 | nan | 0.8242 | 0.9095 | | spnasnet_100 | 128 | 1.0005 | 0.9207 | nan | 0.8496 | 0.9024 | | selecsls42b | 128 | 0.9883 | 0.8982 | nan | 0.9039 | 0.9 | | mixnet_l | 64 | 0.9995 | 0.8486 | nan | 0.7938 | 0.8993 | | mobilenetv3_large_100 | 128 | 1.0002 | 0.8686 | nan | 0.8819 | 0.8982 | | xcit_large_24_p8_224 | 5 | 0.9999 | 0.9206 | nan | nan | 0.8952 | | resnest101e | 32 | 1.0 | 0.9458 | nan | 0.9449 | 0.8922 | | ghostnet_100 | 128 | 0.9998 | 0.8872 | nan | 0.947 | 0.8888 | | visformer_small | 128 | 0.9943 | 0.9442 | nan | 0.9475 | 0.8883 | | fbnetv3_b | 128 | 0.9995 | 0.7866 | nan | 0.7861 | 0.8837 | | dpn107 | 32 | 0.9997 | 0.9285 | nan | 0.8949 | 0.8762 | | convnext_base | 32 | 1.0001 | 0.9077 | nan | 0.7678 | 0.8761 | | twins_pcpvt_base | 32 | 1.0002 | 0.9127 | nan | 0.8351 | 0.8723 | | cspdarknet53 | 64 | 1.0 | 0.8562 | nan | 0.8797 | 0.8624 | | jx_nest_base | 32 | 1.0017 | 0.898 | nan | 0.7112 | 0.8574 | | ese_vovnet19b_dw | 128 | 0.9999 | 0.8938 | nan | 0.9369 | 0.8467 | | sebotnet33ts_256 | 64 | 1.0 | 0.7109 | nan | 0.6852 | 0.841 | | resmlp_12_224 | 128 | 0.9893 | 0.9525 | nan | nan | 0.8169 | | res2net101_26w_4s | 64 | 1.0001 | 0.9307 | nan | 0.8959 | 0.8168 | | crossvit_9_240 | 64 | 1.0001 | 0.8721 | nan | 0.729 | 0.8108 | | mnasnet_100 | 128 | 1.0003 | 0.9126 | nan | 0.8368 | 0.7984 | | coat_lite_mini | 128 | 1.0049 | 0.8826 | nan | 0.7873 | 0.79 | | lcnet_050 | 128 | 1.0005 | 0.7721 | nan | 0.7722 | 0.7579 | | regnety_002 | 128 | 0.9981 | 0.829 | nan | 0.7759 | 0.7465 | | gernet_l | 128 | 1.0 | 0.7965 | nan | 0.8012 | 0.727 | | fbnetc_100 | 128 | 0.9998 | 0.8597 | nan | 0.7507 | 0.7246 | | hrnet_w18 | 2 | 0.9986 | 0.8792 | nan | 0.8869 | 0.6089 | | res2next50 | 2 | 1.0 | 0.8353 | nan | 0.8404 | 0.5946 | | res2net50_14w_8s | 2 | 1.0 | 0.8387 | nan | 0.8474 | 0.5879 | | repvgg_a2 | 128 | 1.0003 | 0.8145 | nan | 0.6633 | 0.536 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | nan | | convmixer_768_32 | 32 | 1.0 | 0.9868 | nan | 0.9807 | nan | | dm_nfnet_f0 | 128 | 0.9393 | 0.897 | nan | 0.9515 | nan | | poolformer_m36 | 64 | 1.0003 | 0.9533 | nan | 0.9368 | nan | | gluon_xception65 | 32 | 0.9999 | 0.9384 | nan | 0.9001 | nan | | adv_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | gluon_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | swsl_resnext101_32x16d | 32 | 1.0003 | 0.8983 | nan | 0.8684 | nan | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9309 | nan | 0.83 | nan | | botnet26t_256 | 128 | 1.0 | 0.8494 | nan | 0.7497 | nan | | pit_b_224 | 64 | 0.9992 | 0.7962 | nan | 0.6417 | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/dfDfsDp.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/phbpNnQ.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/LBy7uox.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 98%, 52/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 98%, 52/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 28%, 15/53 |  2%, 1/43   |  8%, 5/61   |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 83%, 44/53 | 86%, 37/43  | 90%, 55/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.00x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    1.00x    |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.20x    |
|    inductor    |   1.70x    |    2.17x    |    1.30x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    6.19    |    14.88    |    11.64    |
|   aot_eager    |   12.45    |    25.75    |    19.94    |
| aot_cudagraphs |   13.09    |    92.75    |    51.56    |
|  aot_nvfuser   |   29.54    |     0.0     |    80.08    |
|    inductor    |   271.08   |   116.86    |   450.74    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.85x    |    0.86x    |    0.88x    |
| aot_cudagraphs |   0.43x    |    0.38x    |    0.20x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.78x    |    0.82x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | functorch_dp_cifar10 | 64 | 1.0027 | 0.9251 | 0.0 | 1.1901 | 4.8999 | | densenet121 | 4 | 1.0013 | 0.9144 | 0.0 | 1.3911 | 4.7967 | | timm_efficientdet | 1 | 0.9864 | 0.789 | 0.0 | 0.0 | 4.1288 | | BERT_pytorch | 16 | 1.0115 | 0.8389 | 0.0 | 0.0 | 3.1411 | | timm_vision_transformer | 8 | 1.0012 | 0.8564 | 0.0 | 1.3359 | 3.0906 | | drq | 1 | 1.0045 | 0.8048 | 0.0 | 1.0807 | 2.8848 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9951 | 0.9111 | 1.3023 | 1.2173 | 2.8184 | | resnet18 | 16 | 1.0009 | 0.9878 | 0.0 | 1.3374 | 2.7412 | | dcgan | 32 | 0.9775 | 0.9129 | 1.101 | 0.7342 | 2.5263 | | squeezenet1_1 | 32 | 0.9953 | 0.9584 | 1.4044 | 1.1906 | 2.5217 | | hf_Albert | 8 | 1.0009 | 0.9534 | 0.0 | 0.0 | 2.397 | | hf_Bert | 4 | 1.0348 | 0.8618 | 0.0 | 0.0 | 2.2545 | | hf_T5 | 8 | 0.9992 | 0.9393 | 0.0 | 0.0 | 2.1467 | | resnext50_32x4d | 8 | 1.0002 | 0.951 | 0.0 | 1.3302 | 2.1414 | | hf_T5_large | 2 | 1.0183 | 0.8592 | 0.0 | 0.0 | 2.0785 | | lennard_jones | 1000 | 0.9794 | 0.7615 | 1.2817 | 1.0468 | 2.0147 | | mobilenet_v3_large | 32 | 1.0023 | 1.0098 | 0.0 | 1.4107 | 2.0136 | | pytorch_struct | 200 | 0.9866 | 0.746 | 1.1521 | 1.011 | 2.0082 | | hf_GPT2 | 4 | 1.014 | 0.9867 | 0.0 | 0.0 | 1.8579 | | LearningToPaint | 96 | 1.0025 | 1.0068 | 0.0 | 1.355 | 1.8566 | | mnasnet1_0 | 32 | 0.9949 | 1.0116 | 0.8977 | 1.4086 | 1.8302 | | hf_Bart | 4 | 1.0161 | 0.8395 | 0.0 | 0.0 | 1.7504 | | fastNLP_Bert | 6 | 0.9978 | 0.8872 | 0.0 | 0.0 | 1.6528 | | speech_transformer | 32 | 1.0054 | 0.8358 | 0.0 | 0.0 | 1.6385 | | attention_is_all_you_need_pytorch | 256 | 1.0061 | 0.8945 | 0.0 | 0.0 | 1.5148 | | timm_efficientnet | 32 | 0.9619 | 0.8176 | 0.0 | 1.1837 | 1.4918 | | hf_DistilBert | 8 | 1.0156 | 0.969 | 0.0 | 0.0 | 1.478 | | soft_actor_critic | 256 | 1.0223 | 0.7463 | 1.261 | 1.0634 | 1.4398 | | pytorch_unet | 1 | 0.9996 | 0.993 | 0.0 | 1.1553 | 1.3534 | | pytorch_stargan | 16 | 0.9983 | 1.0034 | 0.8258 | 1.0964 | 1.343 | | timm_nfnet | 128 | 0.9994 | 0.9988 | 0.0 | 1.1733 | 1.3237 | | shufflenet_v2_x1_0 | 128 | 0.9995 | 1.0166 | 0.0 | 1.3486 | 1.3069 | | Super_SloMo | 6 | 0.9998 | 0.9956 | 0.0 | 0.0 | 1.2884 | | vgg16 | 64 | 0.9999 | 0.9974 | 0.7982 | 0.9961 | 1.2713 | | Background_Matting | 4 | 0.9996 | 1.0182 | 0.0 | 1.1153 | 1.2157 | | alexnet | 128 | 0.9993 | 0.9964 | 0.788 | 1.0031 | 1.2097 | | timm_vision_transformer_large | 8 | 0.9991 | 0.9893 | 0.0 | 0.9929 | 1.1578 | | hf_Reformer | 4 | 0.9958 | 0.9992 | 0.9196 | 0.0 | 1.1578 | | timm_resnest | 32 | 1.0025 | 1.0206 | 0.0 | 1.3168 | 1.1577 | | hf_BigBird | 2 | 0.9911 | 0.9187 | 0.0 | 0.0 | 1.1435 | | timm_vovnet | 32 | 0.9224 | 0.8875 | 0.0 | 1.1275 | 1.1074 | | tts_angular | 64 | 1.0135 | 0.9582 | 1.0002 | 0.9789 | 1.0026 | | demucs | 4 | 1.0019 | 0.9992 | 0.9995 | 0.9981 | 0.9998 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9958 | 0.6966 | 0.9783 | 0.9901 | | resnet50 | 32 | 1.0016 | 1.0097 | 0.0 | 1.3632 | 0.9717 | | moco | 32 | 0.9956 | 0.0 | 0.0 | 0.0 | 0.9496 | | mobilenet_v2 | 96 | 0.9989 | 0.9866 | 0.0 | 0.9244 | 0.8705 | | timm_regnet | 32 | 0.9775 | 0.9387 | 0.0 | 1.1858 | 0.8539 | | yolov3 | 16 | 0.9991 | 0.988 | 0.0 | 0.9136 | 0.0 | | hf_Longformer | 2 | 0.9636 | 0.877 | 0.8882 | 0.0 | 0.0 | | dlrm | 2048 | 0.0 | 1.173 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9995 | 0.9901 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.98 | 0.762 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 53.0184 | 79.3008 | nan | nan | 1803.9091 | | hf_T5_large | 2 | 36.9772 | 75.6557 | nan | nan | 1747.5354 | | densenet121 | 4 | 13.5869 | 29.3271 | nan | 139.711 | 1688.4482 | | mobilenet_v3_large | 32 | 3.7848 | 9.2923 | nan | 75.4499 | 896.1598 | | mnasnet1_0 | 32 | 3.4537 | 8.6408 | 43.7596 | 46.4028 | 854.0232 | | moco | 32 | 11.75 | nan | nan | nan | 725.5689 | | mobilenet_v2 | 96 | 3.3075 | 8.4906 | nan | 43.1604 | 646.6585 | | resnext50_32x4d | 8 | 3.6179 | 9.2283 | nan | 39.0386 | 614.9828 | | timm_efficientnet | 32 | 5.9682 | 12.6852 | nan | 73.4068 | 568.4268 | | shufflenet_v2_x1_0 | 128 | 3.7705 | 9.7518 | nan | 41.3993 | 446.0734 | | timm_nfnet | 128 | 6.8229 | 13.575 | nan | 42.2533 | 420.5468 | | squeezenet1_1 | 32 | 0.676 | 1.7982 | 8.4293 | 6.8649 | 371.0668 | | timm_resnest | 32 | 1.4485 | 4.3952 | nan | 43.3534 | 364.8723 | | timm_regnet | 32 | 8.4026 | 17.2587 | nan | 66.4884 | 343.2572 | | attention_is_all_you_need_pytorch | 256 | 4.3975 | 13.0252 | nan | nan | 277.6015 | | timm_vovnet | 32 | 3.0983 | 7.3442 | nan | 32.2979 | 246.8202 | | speech_transformer | 32 | 7.5464 | 17.1829 | nan | nan | 246.5735 | | timm_vision_transformer_large | 8 | 23.1783 | 40.2448 | nan | 58.887 | 209.5997 | | resnet18 | 16 | 1.03 | 3.1392 | nan | 23.6353 | 207.0568 | | functorch_dp_cifar10 | 64 | 0.8539 | 2.5937 | nan | 6.4768 | 198.2407 | | timm_vision_transformer | 8 | 3.2 | 8.1832 | nan | 16.3986 | 197.8043 | | LearningToPaint | 96 | 1.0574 | 3.1713 | nan | 31.0316 | 196.8972 | | BERT_pytorch | 16 | 5.0826 | 13.8418 | nan | nan | 177.679 | | hf_T5 | 8 | 3.9598 | 12.752 | nan | nan | 163.3629 | | Background_Matting | 4 | 4.0825 | 9.3277 | nan | 45.7685 | 157.1204 | | resnet50 | 32 | 3.4998 | 9.0749 | nan | 43.7996 | 150.7682 | | hf_Bart | 4 | 7.5111 | 17.2897 | nan | nan | 149.8316 | | fastNLP_Bert | 6 | 5.3619 | 12.7524 | nan | nan | 148.7053 | | hf_GPT2 | 4 | 3.5663 | 10.0245 | nan | nan | 134.6332 | | pytorch_stargan | 16 | 0.856 | 3.2566 | 11.5768 | 7.5483 | 130.0983 | | pytorch_struct | 200 | 0.4439 | 1.2864 | 1.8954 | 5.4421 | 123.7179 | | Super_SloMo | 6 | 2.3215 | 7.1108 | nan | nan | 114.2637 | | hf_Albert | 8 | 1.5003 | 8.7612 | nan | nan | 90.632 | | hf_Reformer | 4 | 3.125 | 5.8245 | 14.0523 | nan | 81.3988 | | hf_Bert | 4 | 5.2086 | 12.6031 | nan | nan | 78.8477 | | hf_BigBird | 2 | 12.0533 | 20.4705 | nan | nan | 70.7791 | | pytorch_unet | 1 | 1.1442 | 3.4413 | nan | 26.7315 | 67.999 | | hf_DistilBert | 8 | 1.8271 | 5.3273 | nan | nan | 55.4364 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.841 | 3.214 | 12.2729 | 5.1571 | 38.082 | | vgg16 | 64 | 0.3688 | 1.1116 | 4.3001 | 3.7105 | 37.4739 | | alexnet | 128 | 0.2897 | 0.6979 | 2.0104 | 3.2319 | 27.3354 | | drq | 1 | 0.2848 | 0.757 | nan | 4.4794 | 24.9175 | | dcgan | 32 | 0.2625 | 0.6388 | 1.9613 | 4.319 | 19.5418 | | nvidia_deeprecommender | 256 | 0.2826 | 0.6825 | 1.0449 | 3.0789 | 15.0583 | | soft_actor_critic | 256 | 0.2699 | 0.4887 | 0.799 | 2.1142 | 14.7981 | | lennard_jones | 1000 | 0.2467 | 0.5134 | 0.7056 | 1.5718 | 7.8075 | | tts_angular | 64 | 0.3394 | 0.392 | 0.5831 | 1.1356 | 4.0722 | | demucs | 4 | 0.9187 | 0.9051 | 0.889 | 0.9058 | 0.8242 | | yolov3 | 16 | 7.5657 | 15.5034 | nan | 46.3505 | nan | | hf_Longformer | 2 | 11.7639 | 21.6948 | 92.072 | nan | nan | | hf_GPT2_large | 4 | 21.5811 | 42.1326 | nan | nan | nan | | tacotron2 | 64 | 14.4366 | 30.0122 | nan | nan | nan | | dlrm | 2048 | nan | 1.1963 | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | hf_Albert | 8 | 0.9814 | 0.936 | nan | nan | 1.1576 | | Super_SloMo | 6 | 1.0024 | 0.9697 | nan | nan | 1.1385 | | timm_nfnet | 128 | 0.9761 | 0.9043 | nan | 0.9504 | 1.0243 | | tts_angular | 64 | 1.0015 | 1.0015 | 0.9866 | 1.0015 | 0.9908 | | attention_is_all_you_need_pytorch | 256 | 0.9976 | 0.9403 | nan | nan | 0.9875 | | demucs | 4 | 0.987 | 0.987 | 0.987 | 0.987 | 0.987 | | timm_efficientdet | 1 | 1.0316 | 0.8425 | nan | nan | 0.9858 | | BERT_pytorch | 16 | 0.9998 | 0.8818 | nan | nan | 0.9728 | | timm_efficientnet | 32 | 0.9982 | 0.7762 | nan | 0.7936 | 0.9689 | | hf_GPT2 | 4 | 0.971 | 0.8627 | nan | nan | 0.9645 | | Background_Matting | 4 | 1.0201 | 0.9679 | nan | 0.987 | 0.9244 | | speech_transformer | 32 | 1.0015 | 0.9177 | nan | nan | 0.9066 | | mobilenet_v2 | 96 | 1.0001 | 0.7725 | nan | 0.9235 | 0.8856 | | pytorch_unet | 1 | 0.9968 | 0.8677 | nan | 0.8518 | 0.8681 | | fastNLP_Bert | 6 | 1.0013 | 0.8966 | nan | nan | 0.8661 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8624 | 0.2638 | 0.8441 | 0.8602 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8535 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | nan | nan | 0.8387 | | hf_Bert | 4 | 0.9844 | 0.8677 | nan | nan | 0.8383 | | timm_regnet | 32 | 0.9999 | 0.8483 | nan | 0.85 | 0.8361 | | hf_Bart | 4 | 0.9099 | 0.8321 | nan | nan | 0.8151 | | hf_BigBird | 2 | 0.9852 | 0.9787 | nan | nan | 0.81 | | timm_vovnet | 32 | 0.9903 | 0.7754 | nan | 0.7817 | 0.7861 | | moco | 32 | 0.9667 | nan | nan | nan | 0.7819 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 0.874 | nan | 0.8652 | 0.7813 | | pytorch_stargan | 16 | 0.9929 | 0.9799 | 0.2149 | 0.8882 | 0.7783 | | resnet50 | 32 | 1.0004 | 0.8678 | nan | 0.8041 | 0.7745 | | dcgan | 32 | 1.0 | 0.7949 | 0.343 | 0.7073 | 0.7527 | | vgg16 | 64 | 0.9998 | 0.7378 | 0.2978 | 0.7172 | 0.7491 | | timm_vision_transformer_large | 8 | 0.9987 | 0.8365 | nan | 0.8491 | 0.7487 | | alexnet | 128 | 1.0003 | 0.8082 | 0.4354 | 0.805 | 0.7352 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.7266 | | timm_resnest | 32 | 0.9868 | 0.8809 | nan | 0.8726 | 0.722 | | timm_vision_transformer | 8 | 1.0001 | 0.8868 | nan | 0.8871 | 0.7151 | | mnasnet1_0 | 32 | 0.9994 | 0.8793 | 0.173 | 0.8217 | 0.6596 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.2952 | 0.7589 | 0.6595 | | mobilenet_v3_large | 32 | 0.999 | 0.8661 | nan | 0.874 | 0.6573 | | resnext50_32x4d | 8 | 1.0 | 0.8591 | nan | 0.823 | 0.6515 | | drq | 1 | 0.9125 | 0.8399 | nan | 0.8395 | 0.6406 | | soft_actor_critic | 256 | 0.964 | 0.9151 | 0.4737 | 0.9151 | 0.6279 | | LearningToPaint | 96 | 0.9252 | 0.7196 | nan | 0.71 | 0.605 | | densenet121 | 4 | 1.0 | 0.8696 | nan | 0.8376 | 0.5739 | | resnet18 | 16 | 0.9782 | 0.7852 | nan | 0.7268 | 0.5644 | | lennard_jones | 1000 | 1.0 | 1.0002 | 0.3735 | 1.0967 | 0.564 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5262 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8131 | nan | 0.846 | 0.4465 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | | hf_Reformer | 4 | 0.3764 | 0.9993 | 0.2539 | nan | 0.3629 | | yolov3 | 16 | 1.0054 | 0.8488 | nan | 0.8244 | nan | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3379 | nan | nan | | hf_GPT2_large | 4 | 0.9586 | 0.8649 | nan | nan | nan | | dlrm | 2048 | nan | 0.7282 | nan | nan | nan | | tacotron2 | 64 | 0.9879 | 0.4059 | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | MT5ForConditionalGeneration | 2 | 1.0217 | 0.8664 | 0.0 | 0.0 | 6.0266 | | MobileBertForMaskedLM | 16 | 1.0165 | 0.8257 | 0.0 | 0.0 | 5.6755 | | ElectraForCausalLM | 1 | 1.0352 | 0.8536 | 0.0 | 0.0 | 5.5645 | | MobileBertForQuestionAnswering | 32 | 1.0175 | 0.8249 | 0.0 | 0.0 | 5.2401 | | YituTechConvBert | 1 | 1.0261 | 0.8468 | 0.0 | 0.0 | 5.0492 | | RobertaForCausalLM | 4 | 1.0398 | 0.8465 | 0.0 | 0.0 | 4.5969 | | MegatronBertForCausalLM | 2 | 1.0374 | 0.8485 | 0.0 | 0.0 | 4.0218 | | OPTForCausalLM | 4 | 1.0159 | 0.8276 | 0.0 | 0.0 | 3.9227 | | M2M100ForConditionalGeneration | 2 | 1.0129 | 0.8218 | 0.0 | 0.0 | 3.6354 | | CamemBert | 1 | 1.0388 | 0.859 | 0.0 | 0.0 | 3.5143 | | PegasusForConditionalGeneration | 4 | 1.0118 | 0.8263 | 0.0 | 0.0 | 3.1923 | | XGLMForCausalLM | 1 | 1.014 | 0.8144 | 0.0 | 0.0 | 3.1413 | | PLBartForConditionalGeneration | 8 | 1.0194 | 0.8247 | 0.0 | 0.0 | 2.7305 | | MegatronBertForQuestionAnswering | 8 | 1.0396 | 0.8582 | 0.0 | 0.0 | 2.7135 | | DistillGPT2 | 1 | 1.0314 | 0.8704 | 0.0 | 0.0 | 2.619 | | MBartForConditionalGeneration | 8 | 1.0167 | 0.8336 | 0.0 | 0.0 | 2.3299 | | GPT2ForSequenceClassification | 4 | 0.9989 | 0.9767 | 0.0 | 0.0 | 2.1375 | | Speech2Text2ForCausalLM | 64 | 1.0086 | 0.8555 | 0.0 | 0.0 | 2.108 | | ElectraForQuestionAnswering | 64 | 0.9994 | 0.9793 | 0.0 | 0.0 | 1.9642 | | TrOCRForCausalLM | 8 | 1.0149 | 0.8298 | 0.0 | 0.0 | 1.8799 | | DistilBertForMaskedLM | 16 | 1.0299 | 0.8516 | 0.0 | 0.0 | 1.8406 | | PegasusForCausalLM | 8 | 1.0109 | 0.826 | 0.0 | 0.0 | 1.8182 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0087 | 0.8891 | 0.0 | 0.0 | 1.7899 | | BartForConditionalGeneration | 1 | 1.0133 | 0.885 | 0.0 | 0.0 | 1.7522 | | DistilBertForQuestionAnswering | 32 | 1.0305 | 0.8437 | 0.0 | 0.0 | 1.7502 | | LayoutLMForSequenceClassification | 16 | 0.9983 | 0.9785 | 0.0 | 0.0 | 1.7243 | | T5ForConditionalGeneration | 4 | 0.9926 | 0.9361 | 0.0 | 0.0 | 1.6977 | | AlbertForQuestionAnswering | 2 | 1.0007 | 0.8082 | 0.0 | 0.0 | 1.6669 | | AlbertForMaskedLM | 2 | 1.0004 | 0.8087 | 0.0 | 0.0 | 1.6562 | | T5Small | 1 | 1.0258 | 0.8963 | 0.0 | 0.0 | 1.593 | | XLNetLMHeadModel | 4 | 1.0008 | 0.9632 | 0.0 | 0.0 | 1.5916 | | LayoutLMForMaskedLM | 16 | 0.9981 | 0.9701 | 0.0 | 0.0 | 1.5762 | | PLBartForCausalLM | 16 | 1.0127 | 0.9448 | 0.0 | 0.0 | 1.5065 | | BartForCausalLM | 2 | 1.0008 | 0.9636 | 0.0 | 0.0 | 1.4564 | | RobertaForQuestionAnswering | 64 | 0.9977 | 0.9494 | 0.0 | 0.0 | 1.4507 | | BertForQuestionAnswering | 64 | 0.9971 | 0.9668 | 0.0 | 0.0 | 1.4373 | | MBartForCausalLM | 16 | 1.0105 | 0.9317 | 0.0 | 0.0 | 1.3877 | | BertForMaskedLM | 64 | 0.9972 | 0.9548 | 0.0 | 0.0 | 1.3316 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9233 | 0.0 | 0.0 | 1.3041 | | DebertaForQuestionAnswering | 4 | 0.9317 | 0.7286 | 0.9211 | 0.0 | 1.2886 | | BigBird | 1 | 0.9945 | 0.9116 | 0.0 | 0.0 | 1.1342 | | DebertaForMaskedLM | 4 | 0.9325 | 0.7359 | 0.7806 | 0.0 | 1.1239 | | AllenaiLongformerBase | 1 | 0.9529 | 0.7382 | 0.8569 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 18.4058 | 40.7914 | nan | nan | 324.9091 | | MobileBertForMaskedLM | 16 | 135.1179 | 174.0517 | nan | nan | 310.7379 | | MobileBertForQuestionAnswering | 32 | 132.1921 | 174.2548 | nan | nan | 288.521 | | T5ForConditionalGeneration | 4 | 4.1395 | 12.6828 | nan | nan | 247.4364 | | M2M100ForConditionalGeneration | 2 | 26.4684 | 45.8915 | nan | nan | 214.2353 | | MT5ForConditionalGeneration | 2 | 6.5618 | 21.1144 | nan | nan | 202.0306 | | YituTechConvBert | 1 | 9.4912 | 20.8653 | nan | nan | 187.9041 | | XGLMForCausalLM | 1 | 15.5894 | 30.5654 | nan | nan | 170.4506 | | MBartForConditionalGeneration | 8 | 26.8296 | 47.6996 | nan | nan | 170.0447 | | PegasusForConditionalGeneration | 4 | 26.2079 | 45.6003 | nan | nan | 167.4475 | | DebertaForMaskedLM | 4 | 7.3099 | 14.5402 | 53.1345 | nan | 163.0197 | | MegatronBertForQuestionAnswering | 8 | 17.0641 | 31.1438 | nan | nan | 161.2878 | | BartForConditionalGeneration | 1 | 26.4428 | 45.956 | nan | nan | 152.167 | | MegatronBertForCausalLM | 2 | 16.4797 | 31.8313 | nan | nan | 144.6966 | | T5Small | 1 | 3.9891 | 12.5222 | nan | nan | 144.5762 | | PLBartForConditionalGeneration | 8 | 7.4848 | 17.1203 | nan | nan | 130.1847 | | BlenderbotSmallForConditionalGeneration | 32 | 12.1662 | 25.0348 | nan | nan | 124.0423 | | DebertaForQuestionAnswering | 4 | 7.3726 | 14.8011 | 53.6215 | nan | 120.8276 | | RobertaForCausalLM | 4 | 5.3202 | 12.7009 | nan | nan | 107.3699 | | LayoutLMForSequenceClassification | 16 | 5.6057 | 12.9613 | nan | nan | 92.3443 | | PegasusForCausalLM | 8 | 9.9424 | 16.9838 | nan | nan | 90.8327 | | OPTForCausalLM | 4 | 4.8978 | 12.0564 | nan | nan | 86.7957 | | BartForCausalLM | 2 | 10.3112 | 17.1423 | nan | nan | 83.3249 | | MBartForCausalLM | 16 | 10.0524 | 17.3672 | nan | nan | 83.1659 | | ElectraForQuestionAnswering | 64 | 5.2311 | 12.7727 | nan | nan | 82.9515 | | BertForMaskedLM | 64 | 5.2735 | 12.6394 | nan | nan | 82.4964 | | LayoutLMForMaskedLM | 16 | 5.5665 | 13.2244 | nan | nan | 80.9633 | | GPT2ForSequenceClassification | 4 | 3.6189 | 10.3118 | nan | nan | 76.5278 | | ElectraForCausalLM | 1 | 5.3414 | 12.6918 | nan | nan | 71.6347 | | TrOCRForCausalLM | 8 | 10.3861 | 17.3261 | nan | nan | 71.0615 | | BigBird | 1 | 11.5557 | 20.1771 | nan | nan | 69.8588 | | DistilBertForQuestionAnswering | 32 | 1.9056 | 5.4488 | nan | nan | 66.5726 | | CamemBert | 1 | 5.3005 | 12.5095 | nan | nan | 65.4941 | | AlbertForMaskedLM | 2 | 1.5705 | 8.8811 | nan | nan | 65.1659 | | BlenderbotSmallForCausalLM | 64 | 4.9754 | 9.5922 | nan | nan | 63.5161 | | PLBartForCausalLM | 16 | 3.2396 | 6.8243 | nan | nan | 62.7263 | | RobertaForQuestionAnswering | 64 | 5.1814 | 12.7826 | nan | nan | 61.2356 | | BertForQuestionAnswering | 64 | 5.1795 | 12.5753 | nan | nan | 60.1564 | | Speech2Text2ForCausalLM | 64 | 3.3817 | 6.9074 | nan | nan | 59.4116 | | DistillGPT2 | 1 | 1.583 | 4.7509 | nan | nan | 58.5183 | | DistilBertForMaskedLM | 16 | 2.0107 | 5.6044 | nan | nan | 50.1556 | | AlbertForQuestionAnswering | 2 | 1.5747 | 8.7694 | nan | nan | 44.0509 | | AllenaiLongformerBase | 1 | 12.4196 | 22.7413 | 92.7516 | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9163 | nan | nan | 1.0699 | | XLNetLMHeadModel | 4 | 0.9912 | 0.8791 | nan | nan | 1.0109 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9539 | nan | nan | 1.0002 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9876 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | nan | nan | 0.9871 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | nan | nan | 0.9811 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | nan | nan | 0.9712 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9998 | 0.8996 | nan | nan | 0.9557 | | BartForCausalLM | 2 | 1.0 | 0.8769 | nan | nan | 0.9545 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.9525 | | Speech2Text2ForCausalLM | 64 | 0.9954 | 0.8265 | nan | nan | 0.9452 | | PLBartForCausalLM | 16 | 1.0006 | 0.8667 | nan | nan | 0.9395 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | | BertForQuestionAnswering | 64 | 0.9995 | 0.9315 | nan | nan | 0.9256 | | RobertaForQuestionAnswering | 64 | 0.9996 | 0.9315 | nan | nan | 0.9254 | | DistilBertForMaskedLM | 16 | 0.9991 | 0.8698 | nan | nan | 0.9167 | | BartForConditionalGeneration | 1 | 1.0 | 0.8619 | nan | nan | 0.881 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.6451 | nan | nan | 0.8636 | | MBartForCausalLM | 16 | 1.0 | 0.8398 | nan | nan | 0.8565 | | AlbertForMaskedLM | 2 | 1.0 | 0.6364 | nan | nan | 0.8515 | | BigBird | 1 | 1.0024 | 0.9513 | nan | nan | 0.8349 | | DistilBertForQuestionAnswering | 32 | 0.9987 | 0.8967 | nan | nan | 0.834 | | PLBartForConditionalGeneration | 8 | 0.9999 | 0.8304 | nan | nan | 0.8252 | | DistillGPT2 | 1 | 1.0006 | 0.7548 | nan | nan | 0.812 | | MBartForConditionalGeneration | 8 | 0.9999 | 0.8187 | nan | nan | 0.7699 | | TrOCRForCausalLM | 8 | 1.0 | 0.7955 | nan | nan | 0.7566 | | CamemBert | 1 | 0.9989 | 0.7872 | nan | nan | 0.7482 | | OPTForCausalLM | 4 | 0.9975 | 0.7501 | nan | nan | 0.7473 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.7407 | | PegasusForCausalLM | 8 | 0.999 | 0.9444 | nan | nan | 0.7324 | | RobertaForCausalLM | 4 | 0.9237 | 0.7741 | nan | nan | 0.7309 | | XGLMForCausalLM | 1 | 0.9999 | 0.9992 | nan | nan | 0.7214 | | MegatronBertForQuestionAnswering | 8 | 0.9051 | 0.8218 | nan | nan | 0.7107 | | MobileBertForMaskedLM | 16 | 0.9985 | 0.8983 | nan | nan | 0.6948 | | PegasusForConditionalGeneration | 4 | 0.9996 | 0.9196 | nan | nan | 0.6769 | | ElectraForCausalLM | 1 | 0.9993 | 0.8955 | nan | nan | 0.6701 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.7726 | nan | nan | 0.6697 | | M2M100ForConditionalGeneration | 2 | 0.9999 | 0.9497 | nan | nan | 0.6569 | | MobileBertForQuestionAnswering | 32 | 1.0142 | 0.9796 | nan | nan | 0.6265 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | nan | nan | 0.6019 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9826 | 0.3599 | nan | 0.4498 | | DebertaForQuestionAnswering | 4 | 0.979 | 1.0568 | 0.3578 | nan | 0.3761 | | AllenaiLongformerBase | 1 | 0.9996 | 0.9477 | 0.3752 | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | hrnet_w18 | 2 | 1.0028 | 0.9644 | 0.0 | 1.3794 | 4.8666 | | res2net50_14w_8s | 2 | 0.9994 | 0.9247 | 0.0 | 1.3968 | 4.7346 | | res2next50 | 2 | 1.0037 | 0.9304 | 0.0 | 1.362 | 4.6397 | | twins_pcpvt_base | 32 | 1.0024 | 0.8988 | 0.0 | 1.36 | 2.5347 | | xcit_large_24_p8_224 | 5 | 1.0003 | 0.0 | 0.0 | 0.0 | 2.1071 | | cait_m36_384 | 2 | 1.0023 | 0.8557 | 0.0 | 1.3541 | 2.0791 | | tnt_s_patch16_224 | 64 | 0.9994 | 0.9944 | 0.0 | 1.8326 | 1.9956 | | ghostnet_100 | 128 | 1.0031 | 1.0008 | 0.0 | 1.5591 | 1.893 | | crossvit_9_240 | 64 | 1.0051 | 0.9639 | 0.0 | 1.1374 | 1.7206 | | gmixer_24_224 | 64 | 0.9987 | 0.8853 | 0.0 | 1.0128 | 1.6752 | | volo_d1_224 | 64 | 0.9994 | 0.9941 | 0.0 | 1.1497 | 1.6642 | | lcnet_050 | 128 | 0.9678 | 0.9515 | 0.0 | 1.6064 | 1.6229 | | nfnet_l0 | 64 | 1.006 | 0.839 | 0.0 | 1.193 | 1.5908 | | regnety_002 | 128 | 0.981 | 0.933 | 0.0 | 1.3813 | 1.5766 | | swin_base_patch4_window7_224 | 64 | 0.9992 | 0.9578 | 0.0 | 1.0465 | 1.5415 | | coat_lite_mini | 128 | 1.0 | 0.9957 | 0.0 | 1.2651 | 1.4983 | | resmlp_12_224 | 128 | 1.0002 | 0.9982 | 0.7823 | 0.0 | 1.4718 | | jx_nest_base | 32 | 0.9992 | 0.9917 | 0.0 | 1.2314 | 1.46 | | resnest101e | 32 | 1.0043 | 0.9905 | 0.0 | 1.4192 | 1.4201 | | gmlp_s16_224 | 64 | 0.9989 | 0.983 | 0.0 | 1.0513 | 1.4139 | | convit_base | 32 | 0.9994 | 0.9914 | 0.0 | 0.0 | 1.3895 | | pit_b_224 | 64 | 0.9995 | 0.9939 | 0.0 | 1.0686 | 1.3627 | | dm_nfnet_f0 | 128 | 0.9992 | 0.9992 | 0.0 | 1.1759 | 1.3014 | | mixer_b16_224 | 64 | 0.9992 | 0.9904 | 0.716 | 0.9657 | 1.2967 | | beit_base_patch16_224 | 64 | 0.9996 | 0.9776 | 0.0 | 1.0503 | 1.2906 | | deit_base_distilled_patch16_224 | 64 | 0.9996 | 0.9913 | 0.0 | 1.0703 | 1.2895 | | adv_inception_v3 | 128 | 1.0 | 0.9952 | 0.0 | 1.1927 | 1.2253 | | gluon_inception_v3 | 128 | 1.0 | 0.9946 | 0.0 | 1.194 | 1.2168 | | inception_v3 | 128 | 1.0 | 0.9952 | 0.0 | 1.1935 | 1.2139 | | poolformer_m36 | 64 | 0.9991 | 0.9974 | 0.0 | 0.0 | 1.2087 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9933 | 0.0 | 0.9995 | 1.1961 | | tf_mixnet_l | 64 | 0.9832 | 0.8984 | 0.0 | 1.1168 | 1.1412 | | mobilevit_s | 32 | 0.9752 | 0.7969 | 0.0 | 1.2175 | 1.1277 | | mixnet_l | 64 | 0.9802 | 0.889 | 0.0 | 1.1177 | 1.0927 | | visformer_small | 128 | 1.0003 | 1.0006 | 0.0 | 1.0867 | 1.0534 | | pnasnet5large | 16 | 1.0052 | 1.0238 | 0.0 | 1.1323 | 1.0315 | | dla102 | 64 | 0.9994 | 1.0099 | 0.0 | 1.3742 | 1.0293 | | fbnetv3_b | 128 | 0.9685 | 0.9577 | 0.0 | 1.2758 | 0.9577 | | mnasnet_100 | 128 | 0.9535 | 0.9394 | 0.6673 | 1.3679 | 0.9231 | | repvgg_a2 | 128 | 0.9416 | 0.9342 | 0.0 | 1.1287 | 0.9156 | | selecsls42b | 128 | 0.9995 | 0.9942 | 0.0 | 1.356 | 0.8981 | | tinynet_a | 128 | 0.9605 | 0.8048 | 0.0 | 1.0887 | 0.8876 | | convmixer_768_32 | 32 | 0.9997 | 0.9979 | 0.0 | 1.0523 | 0.8863 | | dpn107 | 32 | 0.9485 | 0.9127 | 0.0 | 0.9813 | 0.8856 | | cspdarknet53 | 64 | 0.9432 | 0.935 | 0.0 | 0.9008 | 0.8791 | | convnext_base | 32 | 1.0058 | 0.9438 | 0.0 | 1.3613 | 0.8489 | | res2net101_26w_4s | 64 | 1.0025 | 0.996 | 0.0 | 1.3914 | 0.8471 | | mobilenetv3_large_100 | 128 | 0.9552 | 0.9437 | 0.0 | 1.3446 | 0.8334 | | spnasnet_100 | 128 | 0.9462 | 0.9369 | 0.6574 | 1.3183 | 0.8288 | | gernet_l | 128 | 0.9466 | 0.9359 | 0.0 | 1.1389 | 0.7974 | | fbnetc_100 | 128 | 0.9525 | 0.9432 | 0.6733 | 1.3758 | 0.7479 | | eca_halonext26ts | 64 | 0.9639 | 0.8063 | 0.0 | 1.1003 | 0.7363 | | sebotnet33ts_256 | 64 | 0.9669 | 0.8367 | 0.0 | 1.116 | 0.7274 | | tf_efficientnet_b0 | 128 | 0.9642 | 0.8073 | 0.0 | 1.0953 | 0.7162 | | eca_botnext26ts_256 | 64 | 0.9627 | 0.8009 | 0.0 | 1.1043 | 0.703 | | botnet26t_256 | 128 | 0.9783 | 0.9756 | 0.0 | 1.3439 | 0.6823 | | mobilenetv2_100 | 128 | 0.9498 | 0.9402 | 0.0 | 0.8656 | 0.6635 | | ese_vovnet19b_dw | 128 | 0.9693 | 0.965 | 0.0 | 1.2431 | 0.6551 | | rexnet_100 | 128 | 0.9775 | 0.8495 | 0.0 | 1.0358 | 0.6527 | | swsl_resnext101_32x16d | 32 | 0.9995 | 0.9796 | 0.0 | 1.0735 | 0.6428 | | gluon_xception65 | 32 | 0.998 | 0.9783 | 0.0 | 1.0628 | 0.5736 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | fbnetc_100 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | | gmixer_24_224 | 2 | pass | pass | pass | fail_accuracy | pass | | gmlp_s16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | gluon_xception65 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 97.6966 | 141.0268 | nan | 477.271 | 1428.6487 | | pnasnet5large | 16 | 59.9965 | 89.713 | nan | 251.7262 | 1281.2421 | | dpn107 | 32 | 13.8456 | 28.5265 | nan | 112.7519 | 1259.4166 | | rexnet_100 | 128 | 6.6675 | 14.2855 | nan | 122.0738 | 1152.8284 | | res2net50_14w_8s | 2 | 20.0355 | 38.8849 | nan | 123.4919 | 956.6994 | | mobilevit_s | 32 | 6.1429 | 13.5761 | nan | 62.1746 | 912.5899 | | mixnet_l | 64 | 13.5148 | 22.8111 | nan | 89.7195 | 839.4074 | | eca_botnext26ts_256 | 64 | 2.6024 | 7.2121 | nan | 64.9197 | 837.0591 | | twins_pcpvt_base | 32 | 26.7089 | 45.6423 | nan | 99.6458 | 834.726 | | ghostnet_100 | 128 | 9.3711 | 18.8885 | nan | 98.4747 | 771.2625 | | tinynet_a | 128 | 7.7367 | 15.6026 | nan | 84.8686 | 727.8874 | | fbnetv3_b | 128 | 13.3606 | 24.3739 | nan | 111.7964 | 698.534 | | coat_lite_mini | 128 | 3.3191 | 9.0908 | nan | 34.6481 | 679.6933 | | resnest101e | 32 | 26.9489 | 47.6734 | nan | 126.5945 | 658.9299 | | dla102 | 64 | 10.6743 | 22.7225 | nan | 97.4741 | 630.0407 | | fbnetc_100 | 128 | 5.671 | 12.3828 | 89.3023 | 64.1864 | 627.2809 | | sebotnet33ts_256 | 64 | 3.9369 | 9.979 | nan | 70.2461 | 608.7544 | | botnet26t_256 | 128 | 2.5158 | 6.7876 | nan | 51.1621 | 591.8039 | | tf_mixnet_l | 64 | 14.0096 | 23.5027 | nan | 89.8185 | 550.0596 | | cspdarknet53 | 64 | 6.2996 | 13.5792 | nan | 45.3241 | 535.2317 | | eca_halonext26ts | 64 | 2.7292 | 7.5233 | nan | 67.9176 | 531.045 | | res2next50 | 2 | 7.6466 | 17.1314 | nan | 65.4278 | 518.6104 | | tf_efficientnet_b0 | 128 | 6.0641 | 13.1288 | nan | 83.8699 | 508.1979 | | adv_inception_v3 | 128 | 8.67 | 18.8374 | nan | 106.7887 | 469.8298 | | mnasnet_100 | 128 | 4.1978 | 9.7492 | 60.5217 | 53.9337 | 462.3385 | | res2net101_26w_4s | 64 | 25.9614 | 47.214 | nan | 144.0171 | 451.5853 | | swin_base_patch4_window7_224 | 64 | 12.4 | 26.9354 | nan | 82.8153 | 424.9892 | | regnety_002 | 128 | 4.9105 | 10.8595 | nan | 60.7642 | 413.4684 | | nfnet_l0 | 64 | 6.122 | 13.0335 | nan | 40.1358 | 407.7362 | | mobilenetv2_100 | 128 | 4.2405 | 9.2914 | nan | 43.8073 | 400.1731 | | convnext_base | 32 | 12.0387 | 19.3516 | nan | 47.4608 | 400.0663 | | ese_vovnet19b_dw | 128 | 2.0251 | 5.1077 | nan | 40.0498 | 397.4036 | | visformer_small | 128 | 2.3605 | 6.7356 | nan | 32.1076 | 379.8813 | | xcit_large_24_p8_224 | 5 | 37.1179 | nan | nan | nan | 363.892 | | mobilenetv3_large_100 | 128 | 4.5824 | 10.1168 | nan | 86.4595 | 363.6031 | | gluon_xception65 | 32 | 15.4767 | 29.189 | nan | 78.4504 | 353.0086 | | jx_nest_base | 32 | 9.7785 | 19.7435 | nan | 59.9364 | 327.2779 | | cait_m36_384 | 2 | 48.1901 | 71.6923 | nan | 107.6152 | 308.271 | | poolformer_m36 | 64 | 13.1268 | 21.8151 | nan | nan | 304.8976 | | crossvit_9_240 | 64 | 7.7826 | 17.0441 | nan | 42.6455 | 293.8263 | | gernet_l | 128 | 4.9774 | 11.7724 | nan | 48.1579 | 285.8593 | | selecsls42b | 128 | 2.4734 | 6.9182 | nan | 52.4839 | 275.3577 | | spnasnet_100 | 128 | 5.6856 | 12.2802 | 81.6643 | 61.8244 | 262.9874 | | lcnet_050 | 128 | 2.0093 | 5.267 | nan | 39.738 | 252.48 | | gluon_inception_v3 | 128 | 8.4342 | 18.8111 | nan | 107.1782 | 234.7551 | | inception_v3 | 128 | 8.4807 | 18.9464 | nan | 107.6097 | 223.2148 | | swsl_resnext101_32x16d | 32 | 10.3929 | 22.1383 | nan | 63.0214 | 217.9656 | | volo_d1_224 | 64 | 6.874 | 16.0245 | nan | 45.317 | 210.5924 | | convit_base | 32 | 4.1162 | 10.6181 | nan | nan | 190.8993 | | pit_b_224 | 64 | 3.9964 | 9.565 | nan | 27.7656 | 183.1499 | | tnt_s_patch16_224 | 64 | 12.6558 | 24.9605 | nan | 49.1967 | 166.85 | | gmlp_s16_224 | 64 | 9.7371 | 17.5417 | nan | 30.1711 | 149.1056 | | repvgg_a2 | 128 | 4.9392 | 10.6779 | nan | 65.977 | 139.9949 | | gmixer_24_224 | 64 | 8.6395 | 17.5991 | nan | 35.0491 | 131.8514 | | dm_nfnet_f0 | 128 | 6.6834 | 13.7387 | nan | 42.7846 | 128.7499 | | resmlp_12_224 | 128 | 2.834 | 6.1399 | 9.9394 | nan | 102.5117 | | mixer_b16_224 | 64 | 2.8878 | 7.212 | 16.3638 | 18.1333 | 100.1476 | | convmixer_768_32 | 32 | 7.064 | 14.717 | nan | 24.0237 | 85.642 | | beit_base_patch16_224 | 64 | 4.6764 | 10.6156 | nan | 22.3841 | 83.8798 | | deit_base_distilled_patch16_224 | 64 | 3.1291 | 8.244 | nan | 16.9322 | 80.9149 | | vit_base_patch16_224 | 64 | 3.0647 | 7.9743 | nan | 16.4046 | 70.3833 | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 1.0001 | 0.9563 | nan | 0.8998 | 1.2577 | | gmlp_s16_224 | 64 | 1.0 | 0.9679 | nan | 0.92 | 1.2405 | | tinynet_a | 128 | 1.0001 | 0.7955 | nan | 0.7958 | 1.1632 | | pnasnet5large | 16 | 1.0583 | 0.9923 | nan | 1.1741 | 1.1266 | | eca_halonext26ts | 64 | 0.999 | 0.7814 | nan | 0.786 | 1.0889 | | dm_nfnet_f0 | 128 | 0.9758 | 0.9039 | nan | 0.95 | 1.0616 | | tnt_s_patch16_224 | 64 | 1.0 | 0.9718 | nan | 0.9431 | 1.0587 | | volo_d1_224 | 64 | 1.0015 | 0.9518 | nan | 0.8587 | 1.0378 | | convit_base | 32 | 0.9991 | 0.86 | nan | nan | 1.0309 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9367 | nan | 0.9298 | 1.0097 | | mobilevit_s | 32 | 1.0 | 0.7722 | nan | 0.787 | 1.0078 | | rexnet_100 | 128 | 0.9988 | 0.7919 | nan | 0.8648 | 1.001 | | dla102 | 64 | 0.9998 | 0.9549 | nan | 0.9751 | 0.9969 | | pit_b_224 | 64 | 1.0021 | 0.8074 | nan | 0.8179 | 0.9856 | | poolformer_m36 | 64 | 1.0015 | 0.9462 | nan | nan | 0.9797 | | convnext_base | 32 | 1.0065 | 0.908 | nan | 0.7521 | 0.9564 | | twins_pcpvt_base | 32 | 0.9963 | 0.9079 | nan | 0.8007 | 0.9553 | | convmixer_768_32 | 32 | 0.9992 | 0.9807 | nan | 0.9715 | 0.9513 | | visformer_small | 128 | 0.9899 | 0.9353 | nan | 0.8884 | 0.9341 | | resnest101e | 32 | 1.0002 | 0.9762 | nan | 0.9535 | 0.9292 | | tf_mixnet_l | 64 | 0.9995 | 0.8624 | nan | 0.8426 | 0.9291 | | mixer_b16_224 | 64 | 0.9929 | 0.9425 | 0.2532 | 0.7726 | 0.9225 | | tf_efficientnet_b0 | 128 | 1.0006 | 0.7769 | nan | 0.846 | 0.9189 | | nfnet_l0 | 64 | 0.9993 | 0.824 | nan | 0.8257 | 0.9132 | | mobilenetv2_100 | 128 | 0.9992 | 0.7716 | nan | 0.9249 | 0.8963 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9384 | nan | 0.8801 | 0.8916 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9376 | nan | 0.8794 | 0.8911 | | mobilenetv3_large_100 | 128 | 0.9987 | 0.8562 | nan | 0.8673 | 0.8885 | | adv_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_xception65 | 32 | 1.0 | 0.8895 | nan | 0.8854 | 0.8712 | | dpn107 | 32 | 0.9981 | 0.9115 | nan | 0.8834 | 0.87 | | selecsls42b | 128 | 0.9789 | 0.8913 | nan | 0.8811 | 0.866 | | fbnetv3_b | 128 | 1.0003 | 0.7918 | nan | 0.7903 | 0.8647 | | mixnet_l | 64 | 0.9989 | 0.8507 | nan | 0.7796 | 0.8601 | | spnasnet_100 | 128 | 0.9988 | 0.8961 | 0.1651 | 0.8371 | 0.8599 | | eca_botnext26ts_256 | 64 | 0.9998 | 0.7776 | nan | 0.7813 | 0.8533 | | swsl_resnext101_32x16d | 32 | 1.0009 | 0.8805 | nan | 0.8487 | 0.8523 | | xcit_large_24_p8_224 | 5 | 0.9987 | nan | nan | nan | 0.8489 | | resmlp_12_224 | 128 | 0.9827 | 0.9667 | 0.2637 | nan | 0.845 | | ghostnet_100 | 128 | 1.0013 | 0.8903 | nan | 0.9244 | 0.833 | | coat_lite_mini | 128 | 1.0338 | 0.929 | nan | 0.6593 | 0.8328 | | ese_vovnet19b_dw | 128 | 1.0 | 0.867 | nan | 0.9146 | 0.8269 | | cspdarknet53 | 64 | 1.0 | 0.8467 | nan | 0.7906 | 0.813 | | cait_m36_384 | 2 | 0.9998 | 0.8806 | nan | 0.9023 | 0.8081 | | jx_nest_base | 32 | 1.0 | 0.8945 | nan | 0.86 | 0.8 | | crossvit_9_240 | 64 | 1.0008 | 0.8801 | nan | 0.8854 | 0.7934 | | res2net101_26w_4s | 64 | 0.9999 | 0.9202 | nan | 0.8569 | 0.7834 | | mnasnet_100 | 128 | 0.9993 | 0.8882 | 0.1669 | 0.8253 | 0.773 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9234 | nan | 0.8451 | 0.7676 | | sebotnet33ts_256 | 64 | 0.9999 | 0.7108 | nan | 0.7354 | 0.7449 | | gernet_l | 128 | 0.9998 | 0.8655 | nan | 0.83 | 0.7238 | | fbnetc_100 | 128 | 0.9984 | 0.8631 | 0.1626 | 0.7352 | 0.7104 | | lcnet_050 | 128 | 0.9992 | 0.7927 | nan | 0.7885 | 0.705 | | regnety_002 | 128 | 0.9994 | 0.8284 | nan | 0.7819 | 0.6971 | | botnet26t_256 | 128 | 1.0 | 0.8755 | nan | 0.78 | 0.6615 | | res2next50 | 2 | 1.0 | 0.8301 | nan | 0.8198 | 0.6012 | | res2net50_14w_8s | 2 | 1.0 | 0.8275 | nan | 0.8169 | 0.5927 | | hrnet_w18 | 2 | 1.0 | 0.8383 | nan | 0.8363 | 0.5746 | | repvgg_a2 | 128 | 1.0003 | 0.7971 | nan | 0.6902 | 0.5572 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/ZamDtc9.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/hBfg6RI.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/thgEpwZ.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+-------------+-------------+-------------+
|    Compiler    | torchbench  | huggingface | timm_models |
+----------------+-------------+-------------+-------------+
|     eager      | 100%, 55/55 | 93%, 41/44  | 100%, 61/61 |
|   aot_eager    | 98%, 54/55  | 93%, 41/44  | 90%, 55/61  |
| aot_cudagraphs | 29%, 16/55  |  0%, 0/44   |  0%, 0/61   |
|  aot_nvfuser   | 62%, 34/55  |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 87%, 48/55  | 77%, 34/44  | 74%, 45/61  |
+----------------+-------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   1.12x    |    1.13x    |    1.12x    |
|    inductor    |   1.37x    |    1.61x    |    1.24x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.70    |    13.73    |    11.39    |
|   aot_eager    |   10.34    |    20.46    |    17.09    |
| aot_cudagraphs |    4.54    |     0.0     |     0.0     |
|  aot_nvfuser   |   21.31    |    10.74    |    57.51    |
|    inductor    |   265.33   |   111.78    |   417.22    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.87x    |    0.88x    |    0.88x    |
| aot_cudagraphs |   0.48x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   0.84x    |    1.08x    |    0.85x    |
|    inductor    |   0.79x    |    0.74x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 1.0021 | 1.0072 | 0.0 | 1.4515 | 4.6393 | | timm_efficientdet | 1 | 0.9831 | 0.8908 | 0.0 | 0.0 | 3.8674 | | functorch_dp_cifar10 | 64 | 1.0019 | 0.9777 | 0.0 | 1.1919 | 3.6153 | | timm_vision_transformer | 8 | 1.003 | 0.923 | 0.0 | 1.3434 | 2.5786 | | drq | 1 | 0.9972 | 0.8497 | 0.0 | 1.0702 | 2.4508 | | BERT_pytorch | 16 | 1.0091 | 0.8721 | 0.0 | 0.0 | 1.855 | | resnet18 | 16 | 1.003 | 1.1147 | 0.0 | 1.4051 | 1.7636 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9993 | 0.938 | 1.1197 | 1.1919 | 1.729 | | pytorch_struct | 200 | 0.9961 | 0.7502 | 0.8973 | 0.884 | 1.7059 | | lennard_jones | 1000 | 0.9674 | 0.8486 | 1.0724 | 1.0278 | 1.667 | | hf_Albert | 8 | 1.0012 | 0.995 | 0.0 | 0.0 | 1.6645 | | squeezenet1_1 | 32 | 0.9972 | 1.0037 | 0.9904 | 1.1563 | 1.6496 | | dcgan | 32 | 0.9915 | 1.0198 | 1.109 | 1.1794 | 1.6235 | | speech_transformer | 32 | 1.0078 | 0.9013 | 0.0 | 0.0 | 1.4912 | | timm_nfnet | 128 | 0.9995 | 1.0004 | 0.0 | 1.2113 | 1.4741 | | hf_GPT2 | 4 | 1.0129 | 0.9793 | 0.0 | 0.0 | 1.4269 | | hf_T5_large | 2 | 1.0232 | 0.9244 | 0.0 | 0.0 | 1.4038 | | resnext50_32x4d | 8 | 1.0017 | 1.0845 | 0.0 | 1.3674 | 1.4019 | | fastNLP_Bert | 6 | 0.9991 | 0.9746 | 0.0 | 0.0 | 1.3537 | | mobilenet_v3_large | 32 | 1.0051 | 1.1141 | 0.0 | 1.3888 | 1.343 | | soft_actor_critic | 256 | 0.9997 | 0.7922 | 1.0271 | 1.0222 | 1.2641 | | LearningToPaint | 96 | 1.0027 | 1.0327 | 0.0 | 1.2377 | 1.262 | | pytorch_unet | 1 | 0.9997 | 0.9987 | 0.0 | 1.0754 | 1.203 | | hf_Bart | 4 | 1.0137 | 0.9696 | 0.0 | 0.0 | 1.1822 | | vgg16 | 64 | 0.9999 | 0.9984 | 0.7922 | 0.9965 | 1.1723 | | Super_SloMo | 6 | 1.0001 | 0.9977 | 0.0 | 0.0 | 1.1704 | | alexnet | 128 | 0.9993 | 0.9977 | 0.7784 | 1.0005 | 1.1646 | | hf_Bert | 4 | 1.0249 | 1.0019 | 0.0 | 0.0 | 1.1577 | | hf_DistilBert | 8 | 1.0009 | 0.9543 | 0.0 | 0.0 | 1.1516 | | shufflenet_v2_x1_0 | 128 | 1.0001 | 1.0777 | 0.0 | 1.2258 | 1.1504 | | mnasnet1_0 | 32 | 1.0009 | 1.123 | 0.748 | 1.3056 | 1.1302 | | pytorch_stargan | 16 | 0.9995 | 0.9825 | 0.7291 | 0.9891 | 1.1176 | | Background_Matting | 4 | 0.9996 | 1.0224 | 0.0 | 1.0822 | 1.1164 | | hf_Reformer | 4 | 0.9965 | 0.0 | 0.894 | 0.0 | 1.1094 | | timm_efficientnet | 32 | 0.9572 | 0.818 | 0.0 | 1.0643 | 1.095 | | hf_BigBird | 2 | 0.9932 | 0.9458 | 0.0 | 0.0 | 1.0781 | | timm_vision_transformer_large | 8 | 0.9994 | 0.994 | 0.0 | 0.9828 | 1.052 | | attention_is_all_you_need_pytorch | 256 | 0.997 | 0.9694 | 0.0 | 0.0 | 1.0474 | | timm_resnest | 32 | 0.9994 | 1.002 | 0.0 | 1.1837 | 1.0351 | | demucs | 4 | 0.9998 | 0.9992 | 1.0002 | 0.9996 | 0.9995 | | mobilenet_v2_quantized_qat | 96 | 0.9993 | 0.9991 | 0.9986 | 0.9989 | 0.9984 | | resnet50_quantized_qat | 32 | 0.9972 | 0.998 | 0.9985 | 0.998 | 0.998 | | tts_angular | 64 | 0.9963 | 0.96 | 0.9962 | 0.9982 | 0.9919 | | dlrm | 2048 | 1.0936 | 0.932 | 0.0 | 0.0 | 0.9396 | | timm_vovnet | 32 | 0.9057 | 0.9046 | 0.0 | 0.9795 | 0.9172 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.5849 | 0.9423 | 0.9044 | | mobilenet_v2 | 96 | 0.9996 | 0.9984 | 0.0 | 1.0439 | 0.865 | | moco | 32 | 0.9926 | 1.045 | 0.0 | 0.0 | 0.8381 | | resnet50 | 32 | 0.9984 | 0.9932 | 0.0 | 1.1621 | 0.7785 | | timm_regnet | 32 | 0.9649 | 0.9625 | 0.0 | 1.0943 | 0.7707 | | yolov3 | 16 | 0.9995 | 0.9943 | 0.0 | 1.1829 | 0.0 | | hf_Longformer | 2 | 0.9693 | 0.901 | 0.8158 | 0.0 | 0.0 | | hf_T5 | 8 | 1.0007 | 0.9899 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9996 | 0.9801 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9808 | 0.8586 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | mobilenet_v2_quantized_qat | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | resnet50_quantized_qat | 2 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | fail_to_run | | hf_Longformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 51.5766 | 70.6045 | nan | nan | 1764.6785 | | densenet121 | 4 | 13.3012 | 25.1495 | nan | 99.4384 | 1532.0131 | | hf_T5_large | 2 | 35.6569 | 66.2515 | nan | nan | 1068.5147 | | mnasnet1_0 | 32 | 3.1714 | 6.9425 | 24.1489 | 33.4974 | 843.7609 | | mobilenet_v3_large | 32 | 3.5883 | 7.421 | nan | 55.4373 | 787.0648 | | moco | 32 | 11.1404 | 16.7881 | nan | nan | 677.8514 | | mobilenet_v2 | 96 | 3.0986 | 6.6705 | nan | 39.0118 | 623.2404 | | resnext50_32x4d | 8 | 3.3002 | 7.4339 | nan | 30.9222 | 591.0255 | | timm_efficientnet | 32 | 5.7511 | 10.4823 | nan | 56.0236 | 539.9275 | | shufflenet_v2_x1_0 | 128 | 3.5859 | 8.0994 | nan | 29.641 | 449.5994 | | squeezenet1_1 | 32 | 0.6202 | 1.3239 | 3.539 | 4.885 | 366.1201 | | timm_resnest | 32 | 1.3364 | 3.5203 | nan | 35.8046 | 348.1886 | | timm_regnet | 32 | 8.1136 | 14.0954 | nan | 53.1497 | 317.6042 | | timm_vovnet | 32 | 2.9071 | 6.1334 | nan | 24.786 | 265.4777 | | attention_is_all_you_need_pytorch | 256 | 4.266 | 10.1758 | nan | nan | 261.9771 | | speech_transformer | 32 | 7.2245 | 13.6655 | nan | nan | 251.9521 | | functorch_dp_cifar10 | 64 | 0.7908 | 2.0897 | nan | 5.4668 | 204.5091 | | timm_vision_transformer | 8 | 2.9851 | 6.2629 | nan | 11.3289 | 196.1347 | | LearningToPaint | 96 | 0.9587 | 2.4854 | nan | 24.429 | 189.1747 | | resnet18 | 16 | 0.9185 | 2.4438 | nan | 17.9014 | 185.4883 | | timm_vision_transformer_large | 8 | 22.2284 | 34.3611 | nan | 44.8166 | 174.6423 | | BERT_pytorch | 16 | 4.836 | 10.8222 | nan | nan | 174.2309 | | hf_Bart | 4 | 7.2937 | 13.3922 | nan | nan | 150.9699 | | resnet50 | 32 | 3.2836 | 7.3932 | nan | 34.4205 | 145.403 | | pytorch_stargan | 16 | 0.7907 | 2.763 | 9.5307 | 4.3293 | 145.2698 | | fastNLP_Bert | 6 | 4.9808 | 10.0575 | nan | nan | 142.9017 | | Background_Matting | 4 | 3.6956 | 7.4423 | nan | 32.1955 | 141.231 | | hf_GPT2 | 4 | 3.5631 | 8.387 | nan | nan | 139.3171 | | timm_nfnet | 128 | 6.4912 | 11.9484 | nan | 34.2804 | 136.1473 | | pytorch_struct | 200 | 0.4001 | 0.9359 | 1.4509 | 4.2146 | 103.788 | | Super_SloMo | 6 | 2.116 | 5.8313 | nan | nan | 86.5013 | | hf_Albert | 8 | 1.0841 | 5.7737 | nan | nan | 79.1676 | | hf_Bert | 4 | 4.9073 | 9.6611 | nan | nan | 76.0375 | | hf_Reformer | 4 | 3.011 | nan | 13.0912 | nan | 73.2447 | | hf_BigBird | 2 | 10.8878 | 16.7952 | nan | nan | 58.7916 | | pytorch_unet | 1 | 1.0526 | 2.7433 | nan | 20.291 | 56.5606 | | hf_DistilBert | 8 | 1.6504 | 3.9743 | nan | nan | 49.8976 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.7386 | 2.59 | 7.9453 | 4.1358 | 31.9732 | | vgg16 | 64 | 0.3239 | 0.7723 | 2.3694 | 2.6377 | 19.7724 | | drq | 1 | 0.2568 | 0.5426 | nan | 3.49 | 19.6268 | | dlrm | 2048 | 0.5936 | 0.9576 | nan | nan | 17.1468 | | alexnet | 128 | 0.2564 | 0.5024 | 1.1934 | 2.4487 | 15.6905 | | dcgan | 32 | 0.2503 | 0.5086 | 1.2065 | 3.791 | 15.4419 | | nvidia_deeprecommender | 256 | 0.255 | 0.4785 | 0.7806 | 2.4561 | 11.5989 | | soft_actor_critic | 256 | 0.2525 | 0.3811 | 0.6593 | 1.5779 | 10.3899 | | lennard_jones | 1000 | 0.2231 | 0.362 | 0.5064 | 1.1272 | 5.2309 | | tts_angular | 64 | 0.3078 | 0.363 | 0.4981 | 1.0814 | 4.2127 | | resnet50_quantized_qat | 32 | 2.4789 | 2.5093 | 2.5295 | 2.4749 | 2.4968 | | mobilenet_v2_quantized_qat | 96 | 2.3837 | 2.3536 | 2.377 | 2.3057 | 2.2628 | | demucs | 4 | 0.802 | 0.8072 | 0.8072 | 0.7996 | 0.7216 | | yolov3 | 16 | 7.2552 | 13.1212 | nan | 47.2727 | nan | | hf_Longformer | 2 | 11.3734 | 19.0144 | 90.6872 | nan | nan | | hf_GPT2_large | 4 | 21.1646 | 35.4272 | nan | nan | nan | | tacotron2 | 64 | 14.0298 | 26.6327 | nan | nan | nan | | hf_T5 | 8 | 3.8362 | 10.6544 | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | Super_SloMo | 6 | 1.0024 | 0.956 | nan | nan | 1.1857 | | timm_efficientnet | 32 | 0.9998 | 0.7704 | nan | 0.7845 | 1.0652 | | timm_nfnet | 128 | 0.9393 | 0.897 | nan | 0.9515 | 1.022 | | timm_efficientdet | 1 | 1.0142 | 0.8251 | nan | nan | 1.0218 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9967 | 0.9967 | 0.9967 | 1.0001 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.9957 | 0.9957 | 0.9957 | 0.9992 | | mobilenet_v2 | 96 | 0.9993 | 0.7661 | nan | 0.7676 | 0.9975 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.984 | 0.9884 | 0.9842 | | hf_GPT2 | 4 | 0.9548 | 0.887 | nan | nan | 0.9505 | | Background_Matting | 4 | 1.0026 | 0.952 | nan | 0.9773 | 0.9139 | | pytorch_stargan | 16 | 0.9975 | 1.019 | 0.2027 | 1.0085 | 0.9023 | | speech_transformer | 32 | 0.9988 | 0.9152 | nan | nan | 0.896 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9194 | 0.2326 | 0.9141 | 0.8941 | | hf_Albert | 8 | 0.9333 | 0.9333 | nan | nan | 0.8804 | | pytorch_unet | 1 | 0.9985 | 0.8536 | nan | 0.851 | 0.859 | | hf_Bart | 4 | 0.9617 | 0.8786 | nan | nan | 0.853 | | hf_Bert | 4 | 0.9683 | 0.8952 | nan | nan | 0.8517 | | timm_regnet | 32 | 1.0013 | 0.8634 | nan | 0.8806 | 0.8481 | | shufflenet_v2_x1_0 | 128 | 1.0 | 0.9163 | nan | 0.8868 | 0.8447 | | fastNLP_Bert | 6 | 1.0012 | 0.9152 | nan | nan | 0.8343 | | attention_is_all_you_need_pytorch | 256 | 0.9481 | 0.9241 | nan | nan | 0.8264 | | timm_vovnet | 32 | 0.9933 | 0.7644 | nan | 0.7778 | 0.8252 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8237 | | hf_BigBird | 2 | 0.9609 | 0.9609 | nan | nan | 0.8205 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.2781 | 0.9742 | 0.8159 | | hf_DistilBert | 8 | 0.9212 | 0.9053 | nan | nan | 0.7841 | | dcgan | 32 | 1.0 | 0.7784 | 0.3321 | 0.7784 | 0.767 | | moco | 32 | 1.0067 | 0.9701 | nan | nan | 0.767 | | alexnet | 128 | 0.9998 | 0.7731 | 0.3805 | 0.7736 | 0.743 | | mnasnet1_0 | 32 | 0.9988 | 0.9087 | 0.1627 | 0.8348 | 0.7268 | | resnet50 | 32 | 1.0002 | 0.8763 | nan | 0.8011 | 0.7255 | | timm_vision_transformer_large | 8 | 1.0022 | 0.8433 | nan | 0.8015 | 0.7222 | | timm_vision_transformer | 8 | 1.0 | 0.8883 | nan | 0.8108 | 0.712 | | mobilenet_v3_large | 32 | 0.9958 | 0.8655 | nan | 0.8773 | 0.7041 | | dlrm | 2048 | 0.7282 | 0.7283 | nan | nan | 0.6973 | | timm_resnest | 32 | 0.9935 | 0.8869 | nan | 0.8075 | 0.6861 | | densenet121 | 4 | 1.0 | 0.8812 | nan | 0.8571 | 0.6617 | | resnext50_32x4d | 8 | 0.9994 | 0.8687 | nan | 0.8223 | 0.6614 | | vgg16 | 64 | 1.0 | 0.6663 | 0.2532 | 0.6664 | 0.6471 | | LearningToPaint | 96 | 0.9442 | 0.7168 | nan | 0.6504 | 0.6444 | | soft_actor_critic | 256 | 0.964 | 0.964 | 0.4356 | 0.9555 | 0.6428 | | drq | 1 | 0.8541 | 0.8541 | nan | 0.8541 | 0.6427 | | resnet18 | 16 | 0.9846 | 0.7907 | nan | 0.7038 | 0.6163 | | lennard_jones | 1000 | 1.0 | 1.0 | 0.3712 | 1.0947 | 0.5646 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4734 | 0.5598 | 0.5598 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | | functorch_dp_cifar10 | 64 | 0.9626 | 0.8251 | nan | 0.8254 | 0.4037 | | hf_Reformer | 4 | 0.3011 | nan | 0.1803 | nan | 0.299 | | yolov3 | 16 | 1.0072 | 0.8533 | nan | 0.8915 | nan | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.288 | nan | nan | | tacotron2 | 64 | 0.9922 | 1.1046 | nan | nan | nan | | hf_T5 | 8 | 0.9527 | 0.9446 | nan | nan | nan | | hf_GPT2_large | 4 | 0.936 | 0.8771 | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | MT5ForConditionalGeneration | 2 | 1.027 | 0.9168 | 0.0 | 0.0 | 4.3687 | | ElectraForCausalLM | 1 | 1.0453 | 0.9369 | 0.0 | 0.0 | 4.1923 | | YituTechConvBert | 1 | 1.0289 | 0.9299 | 0.0 | 0.0 | 3.4016 | | MegatronBertForCausalLM | 2 | 1.0372 | 0.9357 | 0.0 | 0.0 | 2.8899 | | M2M100ForConditionalGeneration | 2 | 1.0114 | 0.9048 | 0.0 | 0.0 | 2.8587 | | MobileBertForQuestionAnswering | 32 | 1.0194 | 0.9122 | 0.0 | 0.0 | 2.7823 | | MobileBertForMaskedLM | 16 | 1.0195 | 0.903 | 0.0 | 0.0 | 2.6125 | | OPTForCausalLM | 4 | 1.0186 | 0.897 | 0.0 | 0.0 | 2.5828 | | RobertaForCausalLM | 4 | 1.0437 | 0.9334 | 0.0 | 0.0 | 2.5069 | | XGLMForCausalLM | 1 | 1.0146 | 0.8742 | 0.0 | 0.0 | 2.4941 | | CamemBert | 1 | 1.0435 | 0.9498 | 0.0 | 0.0 | 2.2953 | | PegasusForConditionalGeneration | 4 | 1.0124 | 0.8918 | 0.0 | 0.0 | 2.0816 | | DistillGPT2 | 1 | 1.0299 | 0.9446 | 0.0 | 0.0 | 1.9655 | | GoogleFnet | 1 | 1.0046 | 0.8137 | 0.0 | 1.1324 | 1.8265 | | MegatronBertForQuestionAnswering | 8 | 1.0398 | 0.9391 | 0.0 | 0.0 | 1.7417 | | PLBartForConditionalGeneration | 8 | 1.0168 | 0.9089 | 0.0 | 0.0 | 1.7175 | | GPT2ForSequenceClassification | 4 | 0.9988 | 0.9775 | 0.0 | 0.0 | 1.6644 | | MBartForConditionalGeneration | 8 | 1.0163 | 0.9134 | 0.0 | 0.0 | 1.4676 | | XLNetLMHeadModel | 4 | 0.9998 | 0.9649 | 0.0 | 0.0 | 1.4274 | | T5ForConditionalGeneration | 4 | 0.9982 | 0.9723 | 0.0 | 0.0 | 1.3487 | | TrOCRForCausalLM | 8 | 1.0117 | 0.9445 | 0.0 | 0.0 | 1.3447 | | AlbertForQuestionAnswering | 2 | 1.0 | 1.0001 | 0.0 | 0.0 | 1.3067 | | AlbertForMaskedLM | 2 | 1.0006 | 0.9979 | 0.0 | 0.0 | 1.3 | | DebertaForQuestionAnswering | 4 | 0.9388 | 0.7464 | 0.794 | 0.0 | 1.2795 | | LayoutLMForSequenceClassification | 16 | 0.9994 | 0.9881 | 0.0 | 0.0 | 1.2534 | | Speech2Text2ForCausalLM | 64 | 1.0101 | 0.9381 | 0.0 | 0.0 | 1.2338 | | T5Small | 1 | 1.022 | 0.9544 | 0.0 | 0.0 | 1.2217 | | PegasusForCausalLM | 8 | 1.0118 | 0.92 | 0.0 | 0.0 | 1.2173 | | BartForConditionalGeneration | 1 | 1.0142 | 0.9898 | 0.0 | 0.0 | 1.2117 | | DistilBertForQuestionAnswering | 32 | 1.0293 | 0.9825 | 0.0 | 0.0 | 1.1948 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0107 | 0.9413 | 0.0 | 0.0 | 1.1946 | | DistilBertForMaskedLM | 16 | 1.0288 | 0.978 | 0.0 | 0.0 | 1.1572 | | PLBartForCausalLM | 16 | 1.0098 | 0.9437 | 0.0 | 0.0 | 1.1312 | | BartForCausalLM | 2 | 0.9998 | 0.9666 | 0.0 | 0.0 | 1.1055 | | RobertaForQuestionAnswering | 64 | 0.9986 | 0.9822 | 0.0 | 0.0 | 1.0941 | | MBartForCausalLM | 16 | 1.01 | 0.9621 | 0.0 | 0.0 | 1.0884 | | BigBird | 1 | 0.9892 | 0.9347 | 0.0 | 0.0 | 1.0879 | | BertForQuestionAnswering | 64 | 0.9987 | 0.981 | 0.0 | 0.0 | 1.0865 | | BertForMaskedLM | 64 | 0.9988 | 0.9623 | 0.0 | 0.0 | 1.0409 | | DebertaForMaskedLM | 4 | 0.9388 | 0.8149 | 0.7231 | 0.0 | 1.0161 | | BlenderbotSmallForCausalLM | 64 | 1.001 | 0.9085 | 0.0 | 0.0 | 1.008 | | AllenaiLongformerBase | 1 | 0.9551 | 0.8695 | 0.7833 | 0.0 | 0.0 | | ElectraForQuestionAnswering | 64 | 0.999 | 0.9853 | 0.0 | 0.0 | 0.0 | | LayoutLMForMaskedLM | 16 | 0.9991 | 0.9699 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | GoogleFnet | 1 | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BartForConditionalGeneration | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | M2M100ForConditionalGeneration | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 17.8864 | 36.3251 | nan | nan | 317.1367 | | MobileBertForMaskedLM | 16 | 135.2893 | 155.4058 | nan | nan | 271.7268 | | MobileBertForQuestionAnswering | 32 | 133.5434 | 156.8102 | nan | nan | 252.473 | | M2M100ForConditionalGeneration | 2 | 25.5586 | 37.8759 | nan | nan | 222.3239 | | MT5ForConditionalGeneration | 2 | 6.4161 | 16.6703 | nan | nan | 179.1136 | | YituTechConvBert | 1 | 8.9448 | 16.5143 | nan | nan | 176.3369 | | T5ForConditionalGeneration | 4 | 3.7734 | 10.937 | nan | nan | 175.8095 | | XGLMForCausalLM | 1 | 15.1297 | 24.758 | nan | nan | 168.856 | | MBartForConditionalGeneration | 8 | 26.0665 | 38.8293 | nan | nan | 168.5349 | | PegasusForConditionalGeneration | 4 | 25.6046 | 38.5024 | nan | nan | 157.4908 | | DebertaForMaskedLM | 4 | 7.1369 | 13.2473 | 49.7312 | nan | 149.1406 | | BartForConditionalGeneration | 1 | 25.5661 | 37.9984 | nan | nan | 148.0126 | | MegatronBertForQuestionAnswering | 8 | 16.2073 | 25.7688 | nan | nan | 137.4447 | | MegatronBertForCausalLM | 2 | 16.236 | 26.2057 | nan | nan | 136.8413 | | BlenderbotSmallForConditionalGeneration | 32 | 11.9456 | 19.9424 | nan | nan | 134.1319 | | T5Small | 1 | 3.7531 | 10.71 | nan | nan | 133.5697 | | PLBartForConditionalGeneration | 8 | 7.3286 | 13.74 | nan | nan | 132.5687 | | DebertaForQuestionAnswering | 4 | 6.9868 | 12.9985 | 50.6366 | nan | 114.6722 | | RobertaForCausalLM | 4 | 5.2682 | 9.8593 | nan | nan | 100.9032 | | LayoutLMForSequenceClassification | 16 | 5.1824 | 9.9437 | nan | nan | 92.2545 | | PegasusForCausalLM | 8 | 9.8456 | 14.438 | nan | nan | 88.2178 | | MBartForCausalLM | 16 | 9.8451 | 14.2179 | nan | nan | 85.4066 | | OPTForCausalLM | 4 | 4.6586 | 9.5188 | nan | nan | 77.5511 | | BertForMaskedLM | 64 | 4.9281 | 9.7456 | nan | nan | 77.007 | | GPT2ForSequenceClassification | 4 | 3.4782 | 8.0937 | nan | nan | 76.4033 | | BartForCausalLM | 2 | 9.6334 | 14.23 | nan | nan | 76.2828 | | ElectraForCausalLM | 1 | 5.0797 | 9.7233 | nan | nan | 72.6091 | | TrOCRForCausalLM | 8 | 10.038 | 14.4735 | nan | nan | 70.3343 | | BlenderbotSmallForCausalLM | 64 | 4.7331 | 7.8131 | nan | nan | 68.4415 | | Speech2Text2ForCausalLM | 64 | 3.1545 | 5.4563 | nan | nan | 65.9358 | | DistillGPT2 | 1 | 1.4438 | 3.7992 | nan | nan | 63.1728 | | PLBartForCausalLM | 16 | 3.2604 | 5.7169 | nan | nan | 61.9116 | | BertForQuestionAnswering | 64 | 4.8664 | 9.6553 | nan | nan | 60.3864 | | DistilBertForQuestionAnswering | 32 | 1.7088 | 4.0654 | nan | nan | 60.3565 | | CamemBert | 1 | 5.0927 | 9.6565 | nan | nan | 59.6166 | | RobertaForQuestionAnswering | 64 | 4.8469 | 9.7659 | nan | nan | 59.427 | | BigBird | 1 | 10.8289 | 16.7412 | nan | nan | 58.8768 | | AlbertForMaskedLM | 2 | 1.2433 | 5.8391 | nan | nan | 56.5995 | | AlbertForQuestionAnswering | 2 | 1.2235 | 5.7785 | nan | nan | 47.9866 | | DistilBertForMaskedLM | 16 | 1.7344 | 4.111 | nan | nan | 46.6191 | | GoogleFnet | 1 | 1.9789 | 4.2376 | nan | 10.744 | 42.907 | | AllenaiLongformerBase | 1 | 11.4511 | 19.2509 | 86.117 | nan | nan | | LayoutLMForMaskedLM | 16 | 5.5414 | 10.3348 | nan | nan | nan | | ElectraForQuestionAnswering | 64 | 4.8934 | 9.6669 | nan | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9342 | 0.9091 | nan | nan | 1.0318 | | XLNetLMHeadModel | 4 | 1.0001 | 0.8976 | nan | nan | 0.9717 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | nan | nan | 0.9339 | | BertForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8445 | | DistilBertForQuestionAnswering | 32 | 1.0 | 0.9046 | nan | nan | 0.8394 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | nan | nan | 0.8321 | | BartForCausalLM | 2 | 1.0 | 0.8847 | nan | nan | 0.8303 | | BigBird | 1 | 1.0001 | 0.9549 | nan | nan | 0.8224 | | DistilBertForMaskedLM | 16 | 0.9998 | 0.9138 | nan | nan | 0.8055 | | PLBartForCausalLM | 16 | 0.9997 | 0.8802 | nan | nan | 0.8028 | | MBartForCausalLM | 16 | 1.0 | 0.8629 | nan | nan | 0.8005 | | DistillGPT2 | 1 | 1.0003 | 0.7721 | nan | nan | 0.7997 | | Speech2Text2ForCausalLM | 64 | 1.0 | 0.88 | nan | nan | 0.7767 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.7754 | | XGLMForCausalLM | 1 | 0.9999 | 0.9999 | nan | nan | 0.7728 | | BartForConditionalGeneration | 1 | 1.0 | 0.8465 | nan | nan | 0.7708 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0 | 0.9036 | nan | nan | 0.7612 | | PLBartForConditionalGeneration | 8 | 0.9997 | 0.8222 | nan | nan | 0.7547 | | CamemBert | 1 | 0.998 | 0.7977 | nan | nan | 0.7369 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.7298 | | TrOCRForCausalLM | 8 | 1.0 | 0.8048 | nan | nan | 0.7284 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | nan | nan | 0.7277 | | MBartForConditionalGeneration | 8 | 1.0 | 0.8137 | nan | nan | 0.727 | | OPTForCausalLM | 4 | 0.9979 | 0.75 | nan | nan | 0.714 | | RobertaForCausalLM | 4 | 0.9058 | 0.7778 | nan | nan | 0.7099 | | PegasusForCausalLM | 8 | 1.0 | 0.9323 | nan | nan | 0.7012 | | MegatronBertForQuestionAnswering | 8 | 0.923 | 0.8265 | nan | nan | 0.6997 | | GoogleFnet | 1 | 1.0003 | 0.9447 | nan | 1.0813 | 0.6953 | | M2M100ForConditionalGeneration | 2 | 0.9795 | 0.979 | nan | nan | 0.6702 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.7066 | nan | nan | 0.6453 | | PegasusForConditionalGeneration | 4 | 0.9721 | 0.9004 | nan | nan | 0.642 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | nan | nan | 0.6173 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9369 | nan | nan | 0.6126 | | ElectraForCausalLM | 1 | 1.0 | 0.9107 | nan | nan | 0.6123 | | AlbertForMaskedLM | 2 | 0.9999 | 0.9172 | nan | nan | 0.6027 | | MobileBertForMaskedLM | 16 | 0.9997 | 0.9179 | nan | nan | 0.5861 | | MobileBertForQuestionAnswering | 32 | 1.0 | 0.9716 | nan | nan | 0.4668 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.352 | nan | 0.4265 | | DebertaForQuestionAnswering | 4 | 0.9845 | 1.0525 | 0.3277 | nan | 0.3569 | | AllenaiLongformerBase | 1 | 0.9988 | 0.9515 | 0.3144 | nan | nan | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | nan | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 0.9983 | 1.027 | 0.0 | 1.4439 | 4.7917 | | hrnet_w18 | 2 | 1.0076 | 1.0877 | 0.0 | 1.4906 | 4.6235 | | res2next50 | 2 | 1.0034 | 1.0445 | 0.0 | 1.3722 | 4.1476 | | coat_lite_mini | 128 | 1.0 | 0.9994 | 0.0 | 1.0739 | 1.7094 | | ghostnet_100 | 128 | 0.9985 | 0.9939 | 0.0 | 1.249 | 1.5956 | | tnt_s_patch16_224 | 64 | 0.9997 | 0.9961 | 0.0 | 1.5683 | 1.5095 | | twins_pcpvt_base | 32 | 1.0037 | 0.9738 | 0.0 | 1.3525 | 1.4376 | | xcit_large_24_p8_224 | 5 | 1.0006 | 0.9883 | 0.0 | 0.0 | 1.4149 | | crossvit_9_240 | 64 | 1.0049 | 0.9992 | 0.0 | 1.0961 | 1.405 | | volo_d1_224 | 64 | 0.9995 | 0.9952 | 0.0 | 1.1385 | 1.3979 | | nfnet_l0 | 64 | 0.9996 | 0.7979 | 0.0 | 1.0535 | 1.3819 | | gmixer_24_224 | 64 | 0.999 | 0.8428 | 0.0 | 0.9942 | 1.3536 | | jx_nest_base | 32 | 0.9995 | 0.9942 | 0.0 | 1.2243 | 1.2913 | | lcnet_050 | 128 | 0.9564 | 0.9466 | 0.0 | 1.5001 | 1.2739 | | convit_base | 32 | 0.9992 | 0.9931 | 0.0 | 1.1944 | 1.2661 | | convnext_base | 32 | 0.9994 | 0.994 | 0.0 | 1.0411 | 1.2019 | | cait_m36_384 | 2 | 0.9981 | 0.9894 | 0.0 | 0.9966 | 1.196 | | gmlp_s16_224 | 64 | 0.9989 | 0.9964 | 0.0 | 0.9982 | 1.1454 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9743 | 0.0 | 0.9541 | 1.1235 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.998 | 0.0 | 1.0189 | 1.1047 | | regnety_002 | 128 | 0.9778 | 0.9883 | 0.0 | 1.3588 | 1.101 | | vit_base_patch16_224 | 64 | 0.9998 | 0.9982 | 0.0 | 0.9778 | 1.0942 | | mixer_b16_224 | 64 | 0.9997 | 0.9973 | 0.0 | 0.9836 | 1.0789 | | tf_mixnet_l | 64 | 0.9714 | 0.8744 | 0.0 | 1.0062 | 1.0438 | | resmlp_12_224 | 128 | 0.9998 | 0.9997 | 0.0 | 0.0 | 1.0094 | | mixnet_l | 64 | 0.9707 | 0.8727 | 0.0 | 1.0055 | 1.0017 | | dpn107 | 32 | 0.9584 | 0.9514 | 0.0 | 1.029 | 0.9988 | | dla102 | 64 | 0.9992 | 0.9967 | 0.0 | 1.2857 | 0.9897 | | gernet_l | 128 | 0.9739 | 0.969 | 0.0 | 1.0979 | 0.9142 | | resnest101e | 32 | 1.0011 | 1.018 | 0.0 | 1.204 | 0.9009 | | repvgg_a2 | 128 | 0.9634 | 0.9621 | 0.0 | 1.1211 | 0.8987 | | mobilevit_s | 32 | 0.9749 | 0.7654 | 0.0 | 0.9566 | 0.8956 | | visformer_small | 128 | 1.0001 | 1.0006 | 0.0 | 1.0204 | 0.8732 | | selecsls42b | 128 | 0.9998 | 0.9983 | 0.0 | 1.2088 | 0.8727 | | cspdarknet53 | 64 | 0.9586 | 0.9504 | 0.0 | 1.1831 | 0.8635 | | mnasnet_100 | 128 | 0.9646 | 0.9634 | 0.0 | 1.1533 | 0.8582 | | fbnetv3_b | 128 | 0.9648 | 0.9584 | 0.0 | 1.1334 | 0.8559 | | sebotnet33ts_256 | 64 | 0.9761 | 0.8072 | 0.0 | 1.0537 | 0.8532 | | tinynet_a | 128 | 0.9662 | 0.7755 | 0.0 | 0.9712 | 0.8438 | | mobilenetv3_large_100 | 128 | 0.9659 | 0.9624 | 0.0 | 1.1625 | 0.793 | | res2net101_26w_4s | 64 | 0.9987 | 0.9969 | 0.0 | 1.1757 | 0.7829 | | tf_efficientnet_b0 | 128 | 0.9763 | 0.7833 | 0.0 | 0.9849 | 0.7726 | | spnasnet_100 | 128 | 0.961 | 0.9581 | 0.0 | 1.1386 | 0.7679 | | eca_halonext26ts | 64 | 0.9745 | 0.7769 | 0.0 | 1.0166 | 0.7612 | | fbnetc_100 | 128 | 0.9657 | 0.9619 | 0.0 | 1.1839 | 0.7582 | | mobilenetv2_100 | 128 | 0.9666 | 0.9604 | 0.0 | 1.0141 | 0.699 | | eca_botnext26ts_256 | 64 | 0.9736 | 0.7695 | 0.0 | 1.0172 | 0.6956 | | rexnet_100 | 128 | 0.9729 | 0.8138 | 0.0 | 0.983 | 0.6949 | | ese_vovnet19b_dw | 128 | 0.9788 | 0.9775 | 0.0 | 1.1442 | 0.6341 | | botnet26t_256 | 128 | 0.9849 | 0.985 | 0.0 | 1.2249 | 0.0 | | dm_nfnet_f0 | 128 | 0.9998 | 0.9994 | 0.0 | 1.2112 | 0.0 | | adv_inception_v3 | 128 | 1.0 | 0.9987 | 0.0 | 1.1247 | 0.0 | | inception_v3 | 128 | 1.0 | 0.9982 | 0.0 | 1.1244 | 0.0 | | gluon_inception_v3 | 128 | 0.9999 | 0.9986 | 0.0 | 1.1222 | 0.0 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9989 | 0.0 | 1.1076 | 0.0 | | pnasnet5large | 16 | 0.9988 | 0.9982 | 0.0 | 1.0821 | 0.0 | | convmixer_768_32 | 32 | 0.9998 | 0.9999 | 0.0 | 1.061 | 0.0 | | pit_b_224 | 64 | 0.9998 | 0.9976 | 0.0 | 1.0601 | 0.0 | | gluon_xception65 | 32 | 0.9992 | 0.9976 | 0.0 | 1.0409 | 0.0 | | poolformer_m36 | 64 | 0.9994 | 0.9985 | 0.0 | 1.0061 | 0.0 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9787 | 0.0 | 0.9982 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | convnext_base | 2 | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | | gluon_xception65 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 97.691 | 128.2568 | nan | 297.4634 | 1326.4957 | | dpn107 | 32 | 13.3213 | 24.7932 | nan | 87.0993 | 1248.9361 | | rexnet_100 | 128 | 6.4621 | 12.2169 | nan | 106.2673 | 954.6179 | | res2net50_14w_8s | 2 | 19.923 | 34.3528 | nan | 87.1183 | 931.7786 | | mobilevit_s | 32 | 5.6473 | 11.2171 | nan | 45.1615 | 830.5169 | | mixnet_l | 64 | 13.4325 | 20.6526 | nan | 69.3248 | 755.6167 | | eca_botnext26ts_256 | 64 | 2.4512 | 6.3653 | nan | 49.7973 | 739.881 | | ghostnet_100 | 128 | 8.9586 | 16.0244 | nan | 65.523 | 667.2918 | | tinynet_a | 128 | 7.716 | 13.3784 | nan | 67.6541 | 645.5261 | | fbnetc_100 | 128 | 5.434 | 10.4927 | nan | 50.2826 | 612.6355 | | resnest101e | 32 | 26.5974 | 40.6878 | nan | 100.07 | 606.1691 | | twins_pcpvt_base | 32 | 26.1593 | 36.7425 | nan | 69.8539 | 604.2703 | | fbnetv3_b | 128 | 12.668 | 20.6589 | nan | 85.5758 | 582.8326 | | coat_lite_mini | 128 | 3.1411 | 7.0172 | nan | 16.5631 | 571.6242 | | res2next50 | 2 | 7.3453 | 14.7597 | nan | 47.873 | 543.6359 | | dla102 | 64 | 10.6899 | 18.9579 | nan | 71.5604 | 512.4648 | | mnasnet_100 | 128 | 4.0287 | 7.8445 | nan | 40.2794 | 476.7122 | | tf_mixnet_l | 64 | 13.6793 | 21.1891 | nan | 69.8174 | 473.6279 | | sebotnet33ts_256 | 64 | 3.7753 | 8.4122 | nan | 53.6861 | 472.7106 | | eca_halonext26ts | 64 | 2.5729 | 6.504 | nan | 51.8466 | 455.4176 | | cspdarknet53 | 64 | 5.8183 | 11.2112 | nan | 52.2796 | 454.9286 | | res2net101_26w_4s | 64 | 25.839 | 41.9511 | nan | 106.9702 | 414.5372 | | tf_efficientnet_b0 | 128 | 5.8682 | 10.6433 | nan | 65.7641 | 404.0946 | | ese_vovnet19b_dw | 128 | 1.8725 | 4.1691 | nan | 31.8386 | 401.7781 | | mobilenetv2_100 | 128 | 4.1951 | 8.0722 | nan | 39.9188 | 346.6298 | | convnext_base | 32 | 11.3503 | 16.231 | nan | 31.7707 | 334.6469 | | regnety_002 | 128 | 4.7306 | 9.0104 | nan | 49.7585 | 326.9695 | | xcit_large_24_p8_224 | 5 | 36.843 | 52.7637 | nan | nan | 322.862 | | jx_nest_base | 32 | 9.6403 | 17.4674 | nan | 66.114 | 322.1634 | | mobilenetv3_large_100 | 128 | 4.3189 | 8.215 | nan | 67.2751 | 296.0159 | | visformer_small | 128 | 2.2803 | 5.4314 | nan | 25.6553 | 293.7596 | | cait_m36_384 | 2 | 48.6937 | 65.4215 | nan | 92.2057 | 279.3734 | | gernet_l | 128 | 4.7024 | 9.9219 | nan | 39.0823 | 252.524 | | crossvit_9_240 | 64 | 7.4244 | 13.9238 | nan | 32.9177 | 251.2102 | | selecsls42b | 128 | 2.3137 | 5.5553 | nan | 40.3432 | 243.8417 | | spnasnet_100 | 128 | 5.3442 | 10.3948 | nan | 46.8295 | 227.5039 | | lcnet_050 | 128 | 1.9178 | 4.1662 | nan | 31.8492 | 219.0054 | | volo_d1_224 | 64 | 6.695 | 12.6315 | nan | 32.6511 | 192.8301 | | convit_base | 32 | 3.8807 | 8.8518 | nan | 21.3229 | 187.4577 | | gmlp_s16_224 | 64 | 9.0829 | 14.1574 | nan | 21.2325 | 149.2961 | | tnt_s_patch16_224 | 64 | 11.8226 | 21.1815 | nan | 34.8234 | 140.3073 | | gmixer_24_224 | 64 | 8.2047 | 14.0553 | nan | 23.6592 | 132.0265 | | repvgg_a2 | 128 | 4.598 | 8.933 | nan | 46.5715 | 124.4128 | | resmlp_12_224 | 128 | 2.6661 | 4.8475 | nan | nan | 98.1064 | | nfnet_l0 | 64 | 5.9174 | 11.4931 | nan | 30.9432 | 96.3515 | | mixer_b16_224 | 64 | 2.6958 | 5.1905 | nan | 12.7396 | 94.3682 | | deit_base_distilled_patch16_224 | 64 | 3.0897 | 6.374 | nan | 12.9275 | 84.8878 | | beit_base_patch16_224 | 64 | 4.6591 | 9.1219 | nan | 17.496 | 83.1654 | | vit_base_patch16_224 | 64 | 2.8722 | 6.2339 | nan | 11.5018 | 68.0847 | | pnasnet5large | 16 | 59.4832 | 80.4982 | nan | 183.5509 | nan | | adv_inception_v3 | 128 | 8.161 | 15.6215 | nan | 74.7227 | nan | | gluon_inception_v3 | 128 | 8.2187 | 15.8574 | nan | 74.6038 | nan | | inception_v3 | 128 | 8.1272 | 15.7713 | nan | 74.2407 | nan | | swin_base_patch4_window7_224 | 64 | 11.989 | 21.809 | nan | 68.8397 | nan | | gluon_xception65 | 32 | 14.9902 | 24.9327 | nan | 55.4597 | nan | | swsl_resnext101_32x16d | 32 | 10.0119 | 18.3546 | nan | 49.483 | nan | | botnet26t_256 | 128 | 2.4012 | 5.6863 | nan | 42.0424 | nan | | dm_nfnet_f0 | 128 | 6.5043 | 11.8243 | nan | 34.6682 | nan | | poolformer_m36 | 64 | 13.1099 | 19.2828 | nan | 34.655 | nan | | convmixer_768_32 | 32 | 6.8749 | 11.8401 | nan | 20.2715 | nan | | pit_b_224 | 64 | 3.6016 | 7.4214 | nan | 15.3574 | nan | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 0.9992 | 0.9684 | nan | 0.9825 | 1.3808 | | nfnet_l0 | 64 | 1.0008 | 0.8298 | nan | 0.813 | 1.2558 | | tinynet_a | 128 | 1.0 | 0.7831 | nan | 0.7845 | 1.1735 | | rexnet_100 | 128 | 0.9992 | 0.7879 | nan | 0.871 | 1.1072 | | convit_base | 32 | 1.0001 | 0.8879 | nan | 0.9506 | 1.068 | | mobilenetv2_100 | 128 | 0.9998 | 0.7664 | nan | 0.7679 | 1.0051 | | mobilevit_s | 32 | 0.9999 | 0.7692 | nan | 0.7431 | 1.0011 | | dla102 | 64 | 0.9881 | 0.9181 | nan | 0.9541 | 1.001 | | eca_halonext26ts | 64 | 0.9999 | 0.7717 | nan | 0.7731 | 0.9711 | | eca_botnext26ts_256 | 64 | 1.0 | 0.7705 | nan | 0.7679 | 0.9703 | | tf_mixnet_l | 64 | 1.0001 | 0.861 | nan | 0.8605 | 0.9698 | | cait_m36_384 | 2 | 1.0001 | 0.9024 | nan | 0.9202 | 0.9451 | | tf_efficientnet_b0 | 128 | 0.9998 | 0.7727 | nan | 0.8426 | 0.9413 | | mixer_b16_224 | 64 | 0.9956 | 0.9615 | nan | 0.8644 | 0.9357 | | beit_base_patch16_224 | 64 | 1.0 | 0.9575 | nan | 0.8606 | 0.9272 | | gmlp_s16_224 | 64 | 1.0 | 0.9766 | nan | 0.966 | 0.9267 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9469 | nan | 0.8229 | 0.915 | | tnt_s_patch16_224 | 64 | 1.0001 | 0.9752 | nan | 0.8518 | 0.9131 | | volo_d1_224 | 64 | 0.9999 | 0.9247 | nan | 0.7472 | 0.9124 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9476 | nan | 0.8242 | 0.9095 | | spnasnet_100 | 128 | 1.0005 | 0.9207 | nan | 0.8496 | 0.9024 | | selecsls42b | 128 | 0.9883 | 0.8982 | nan | 0.9039 | 0.8999 | | mixnet_l | 64 | 0.9995 | 0.8486 | nan | 0.7938 | 0.8993 | | mobilenetv3_large_100 | 128 | 1.0002 | 0.8686 | nan | 0.8819 | 0.8982 | | xcit_large_24_p8_224 | 5 | 0.9999 | 0.9206 | nan | nan | 0.8952 | | resnest101e | 32 | 1.0 | 0.9458 | nan | 0.9449 | 0.8922 | | ghostnet_100 | 128 | 0.9998 | 0.8872 | nan | 0.947 | 0.8888 | | visformer_small | 128 | 0.9943 | 0.9442 | nan | 0.9475 | 0.8883 | | fbnetv3_b | 128 | 0.9995 | 0.7866 | nan | 0.7861 | 0.8837 | | dpn107 | 32 | 0.9997 | 0.9285 | nan | 0.8949 | 0.8763 | | convnext_base | 32 | 1.0001 | 0.9077 | nan | 0.7678 | 0.8762 | | twins_pcpvt_base | 32 | 1.0002 | 0.9127 | nan | 0.8351 | 0.8723 | | cspdarknet53 | 64 | 1.0 | 0.8562 | nan | 0.8797 | 0.8624 | | jx_nest_base | 32 | 1.0017 | 0.898 | nan | 0.7112 | 0.8574 | | ese_vovnet19b_dw | 128 | 0.9999 | 0.8938 | nan | 0.9369 | 0.8467 | | sebotnet33ts_256 | 64 | 1.0 | 0.7109 | nan | 0.6852 | 0.841 | | resmlp_12_224 | 128 | 0.9893 | 0.9525 | nan | nan | 0.8169 | | res2net101_26w_4s | 64 | 1.0001 | 0.9307 | nan | 0.8959 | 0.8167 | | crossvit_9_240 | 64 | 1.0001 | 0.8721 | nan | 0.729 | 0.8108 | | mnasnet_100 | 128 | 1.0003 | 0.9126 | nan | 0.8368 | 0.7984 | | coat_lite_mini | 128 | 1.0049 | 0.8826 | nan | 0.7873 | 0.79 | | lcnet_050 | 128 | 1.0005 | 0.7721 | nan | 0.7722 | 0.7579 | | regnety_002 | 128 | 0.9981 | 0.829 | nan | 0.7759 | 0.7465 | | gernet_l | 128 | 1.0 | 0.7965 | nan | 0.8012 | 0.727 | | fbnetc_100 | 128 | 0.9998 | 0.8597 | nan | 0.7507 | 0.7246 | | hrnet_w18 | 2 | 0.9986 | 0.8792 | nan | 0.8869 | 0.6089 | | res2next50 | 2 | 1.0 | 0.8353 | nan | 0.8404 | 0.606 | | res2net50_14w_8s | 2 | 1.0 | 0.8387 | nan | 0.8474 | 0.5877 | | repvgg_a2 | 128 | 1.0003 | 0.8145 | nan | 0.6633 | 0.536 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | nan | | convmixer_768_32 | 32 | 1.0 | 0.9868 | nan | 0.9807 | nan | | dm_nfnet_f0 | 128 | 0.9393 | 0.897 | nan | 0.9515 | nan | | poolformer_m36 | 64 | 1.0003 | 0.9533 | nan | 0.9368 | nan | | gluon_xception65 | 32 | 0.9999 | 0.9384 | nan | 0.9001 | nan | | adv_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | gluon_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | nan | | swsl_resnext101_32x16d | 32 | 1.0003 | 0.8983 | nan | 0.8684 | nan | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9309 | nan | 0.83 | nan | | botnet26t_256 | 128 | 1.0 | 0.8494 | nan | 0.7497 | nan | | pit_b_224 | 64 | 0.9992 | 0.7962 | nan | 0.6417 | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/UK3S63s.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/QktiM8v.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/fpj6SsV.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager | 93%, 41/44  |
| inductor  | 64%, 28/44  |
+-----------+-------------+

Geometric mean speedup

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    1.00x    |
| inductor  |    1.76x    |
+-----------+-------------+

Mean compilation time (seconds)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    20.82    |
| inductor  |    80.93    |
+-----------+-------------+

Peak memory footprint compression ratio (higher is better)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    0.88x    |
| inductor  |    0.74x    |
+-----------+-------------+

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/zXAGdnx.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/akx9SYz.png)

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | MT5ForConditionalGeneration | 2 | 0.912 | 4.6277 | | ElectraForCausalLM | 1 | 0.9372 | 4.1844 | | YituTechConvBert | 1 | 0.9314 | 3.7368 | | MegatronBertForCausalLM | 2 | 0.9425 | 3.3657 | | OPTForCausalLM | 4 | 0.9837 | 2.9643 | | MobileBertForMaskedLM | 16 | 0.9291 | 2.9327 | | RobertaForCausalLM | 4 | 0.9599 | 2.5863 | | M2M100ForConditionalGeneration | 2 | 0.9524 | 2.544 | | XGLMForCausalLM | 1 | 0.8789 | 2.4742 | | PegasusForConditionalGeneration | 4 | 0.8936 | 2.4345 | | MobileBertForQuestionAnswering | 32 | 0.9097 | 2.3948 | | CamemBert | 1 | 0.9434 | 2.2449 | | GoogleFnet | 1 | 0.8119 | 2.0603 | | DistillGPT2 | 1 | 0.934 | 1.9454 | | MegatronBertForQuestionAnswering | 8 | 0.932 | 1.8596 | | PLBartForConditionalGeneration | 8 | 0.9042 | 1.6688 | | MBartForConditionalGeneration | 8 | 0.8875 | 1.4768 | | XLNetLMHeadModel | 4 | 0.9655 | 1.427 | | T5Small | 1 | 0.9592 | 1.358 | | Speech2Text2ForCausalLM | 64 | 0.9438 | 1.2946 | | DistilBertForQuestionAnswering | 32 | 0.9767 | 1.2753 | | TrOCRForCausalLM | 8 | 0.9338 | 1.2341 | | PegasusForCausalLM | 8 | 0.9351 | 1.2218 | | BartForConditionalGeneration | 1 | 0.9916 | 1.2055 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9314 | 1.1764 | | DebertaForQuestionAnswering | 4 | 0.7412 | 1.1722 | | DistilBertForMaskedLM | 16 | 0.98 | 1.163 | | PLBartForCausalLM | 16 | 0.9466 | 1.1229 | | BartForCausalLM | 2 | 0.9662 | 1.1018 | | RobertaForQuestionAnswering | 64 | 0.9825 | 1.0993 | | BigBird | 1 | 0.9386 | 1.0925 | | BertForQuestionAnswering | 64 | 0.9818 | 1.0919 | | MBartForCausalLM | 16 | 0.9638 | 1.0433 | | AlbertForQuestionAnswering | 2 | 0.9998 | 0.0 | | AlbertForMaskedLM | 2 | 0.9979 | 0.0 | | LayoutLMForSequenceClassification | 16 | 0.9875 | 0.0 | | ElectraForQuestionAnswering | 64 | 0.984 | 0.0 | | GPT2ForSequenceClassification | 4 | 0.9756 | 0.0 | | T5ForConditionalGeneration | 4 | 0.9709 | 0.0 | | LayoutLMForMaskedLM | 16 | 0.9701 | 0.0 | | BertForMaskedLM | 64 | 0.9612 | 0.0 | | BlenderbotSmallForCausalLM | 64 | 0.9085 | 0.0 | | AllenaiLongformerBase | 1 | 0.8731 | 0.0 | | DebertaForMaskedLM | 4 | 0.8027 | 0.0 | +-----------------------------------------+----+-----------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-----------+-------------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+-------------+ | BartForCausalLM | 1 | pass | pass | | BertForMaskedLM | 1 | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | | BigBird | 1 | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | | CamemBert | 1 | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | | DistillGPT2 | 1 | pass | pass | | ElectraForCausalLM | 1 | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | | GoogleFnet | 1 | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | | MBartForCausalLM | 1 | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | | OPTForCausalLM | 1 | pass | pass | | PLBartForCausalLM | 1 | pass | pass | | PegasusForCausalLM | 1 | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | | RobertaForCausalLM | 1 | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | | T5Small | 1 | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | | YituTechConvBert | 1 | pass | pass | | AlbertForMaskedLM | 1 | pass | fail_to_run | | AlbertForQuestionAnswering | 1 | pass | fail_to_run | | AllenaiLongformerBase | 1 | pass | fail_to_run | | MBartForConditionalGeneration | 1 | pass | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | fail_to_run | | BartForConditionalGeneration | 0 | 0.0000 | 0.0000 | | M2M100ForConditionalGeneration | 0 | 0.0000 | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | +-----------------------------------------+----+-----------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | MobileBertForMaskedLM | 16 | 161.9975 | 230.3942 | | MobileBertForQuestionAnswering | 32 | 156.5321 | 229.1364 | | M2M100ForConditionalGeneration | 2 | 36.8031 | 169.1486 | | XLNetLMHeadModel | 4 | 36.3955 | 140.9739 | | MBartForConditionalGeneration | 8 | 39.8095 | 135.2674 | | XGLMForCausalLM | 1 | 25.104 | 133.6482 | | BartForConditionalGeneration | 1 | 38.6818 | 127.4739 | | PegasusForConditionalGeneration | 4 | 38.2538 | 119.5229 | | MT5ForConditionalGeneration | 2 | 16.9923 | 111.013 | | DebertaForQuestionAnswering | 4 | 13.4467 | 110.5823 | | MegatronBertForCausalLM | 2 | 26.3919 | 108.5788 | | MegatronBertForQuestionAnswering | 8 | 26.2115 | 106.7605 | | PLBartForConditionalGeneration | 8 | 13.4489 | 91.7812 | | BlenderbotSmallForConditionalGeneration | 32 | 20.3056 | 86.0149 | | T5Small | 1 | 11.1881 | 80.8945 | | YituTechConvBert | 1 | 16.8696 | 78.8642 | | TrOCRForCausalLM | 8 | 14.528 | 68.5267 | | OPTForCausalLM | 4 | 9.7842 | 64.4264 | | MBartForCausalLM | 16 | 14.6384 | 60.153 | | PegasusForCausalLM | 8 | 14.4482 | 60.0701 | | BartForCausalLM | 2 | 14.2281 | 57.4828 | | RobertaForCausalLM | 4 | 10.1821 | 57.4763 | | ElectraForCausalLM | 1 | 9.8272 | 56.1576 | | RobertaForQuestionAnswering | 64 | 9.9816 | 56.0056 | | BertForQuestionAnswering | 64 | 9.7034 | 55.3096 | | CamemBert | 1 | 9.9611 | 52.6892 | | BigBird | 1 | 17.3435 | 52.4819 | | Speech2Text2ForCausalLM | 64 | 5.4863 | 44.3714 | | PLBartForCausalLM | 16 | 5.4581 | 41.5006 | | DistilBertForMaskedLM | 16 | 4.3538 | 37.1118 | | DistilBertForQuestionAnswering | 32 | 4.2162 | 34.2192 | | GoogleFnet | 1 | 4.3239 | 33.2724 | | DistillGPT2 | 1 | 3.8532 | 32.1718 | | AllenaiLongformerBase | 1 | 19.6187 | nan | | DebertaForMaskedLM | 4 | 13.4496 | nan | | T5ForConditionalGeneration | 4 | 11.1033 | nan | | LayoutLMForSequenceClassification | 16 | 10.3767 | nan | | LayoutLMForMaskedLM | 16 | 10.3325 | nan | | BertForMaskedLM | 64 | 9.9551 | nan | | ElectraForQuestionAnswering | 64 | 9.8843 | nan | | GPT2ForSequenceClassification | 4 | 8.4326 | nan | | BlenderbotSmallForCausalLM | 64 | 8.0904 | nan | | AlbertForMaskedLM | 2 | 6.3343 | nan | | AlbertForQuestionAnswering | 2 | 5.8626 | nan | +-----------------------------------------+----+-----------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | XLNetLMHeadModel | 4 | 0.8976 | 0.9807 | | BertForQuestionAnswering | 64 | 0.9467 | 0.9145 | | RobertaForQuestionAnswering | 64 | 0.9467 | 0.9145 | | T5Small | 1 | 0.9325 | 0.8445 | | DistilBertForQuestionAnswering | 32 | 0.9046 | 0.8405 | | DistilBertForMaskedLM | 16 | 0.9138 | 0.8391 | | BartForCausalLM | 2 | 0.8847 | 0.8303 | | ElectraForCausalLM | 1 | 0.9107 | 0.827 | | BigBird | 1 | 0.9549 | 0.8224 | | PLBartForCausalLM | 16 | 0.8802 | 0.8028 | | MBartForCausalLM | 16 | 0.8629 | 0.8005 | | DistillGPT2 | 1 | 0.7721 | 0.7997 | | Speech2Text2ForCausalLM | 64 | 0.88 | 0.7767 | | PLBartForConditionalGeneration | 8 | 0.8222 | 0.7744 | | XGLMForCausalLM | 1 | 0.9999 | 0.7728 | | BartForConditionalGeneration | 1 | 0.8465 | 0.7708 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9036 | 0.7612 | | CamemBert | 1 | 0.7977 | 0.7369 | | YituTechConvBert | 1 | 0.7923 | 0.7298 | | TrOCRForCausalLM | 8 | 0.8048 | 0.7284 | | MBartForConditionalGeneration | 8 | 0.8137 | 0.727 | | OPTForCausalLM | 4 | 0.75 | 0.714 | | RobertaForCausalLM | 4 | 0.7778 | 0.7099 | | PegasusForCausalLM | 8 | 0.9323 | 0.7012 | | MegatronBertForQuestionAnswering | 8 | 0.8265 | 0.6997 | | GoogleFnet | 1 | 0.9447 | 0.6953 | | M2M100ForConditionalGeneration | 2 | 0.9801 | 0.6643 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.6453 | | PegasusForConditionalGeneration | 4 | 0.9004 | 0.642 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | | MobileBertForMaskedLM | 16 | 0.9179 | 0.5861 | | MobileBertForQuestionAnswering | 32 | 0.9716 | 0.4668 | | DebertaForQuestionAnswering | 4 | 1.0525 | 0.3569 | | DebertaForMaskedLM | 4 | 0.9851 | nan | | T5ForConditionalGeneration | 4 | 0.9597 | nan | | ElectraForQuestionAnswering | 64 | 0.9524 | nan | | AllenaiLongformerBase | 1 | 0.9515 | nan | | LayoutLMForMaskedLM | 16 | 0.9409 | nan | | AlbertForQuestionAnswering | 2 | 0.9369 | nan | | LayoutLMForSequenceClassification | 16 | 0.9348 | nan | | BertForMaskedLM | 64 | 0.9219 | nan | | AlbertForMaskedLM | 2 | 0.9172 | nan | | GPT2ForSequenceClassification | 4 | 0.9091 | nan | | BlenderbotSmallForCausalLM | 64 | 0.8401 | nan | +-----------------------------------------+----+-----------+----------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/qis9laD.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 98%, 52/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 98%, 52/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 28%, 15/53 |  2%, 1/43   |  10%, 6/61  |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 81%, 43/53 | 86%, 37/43  | 90%, 55/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    1.00x    |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.20x    |
|    inductor    |   1.68x    |    2.20x    |    1.31x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    6.15    |    14.88    |    11.73    |
|   aot_eager    |   12.44    |    25.70    |    19.93    |
| aot_cudagraphs |   12.80    |    93.53    |    51.65    |
|  aot_nvfuser   |   29.54    |     0.0     |    79.13    |
|    inductor    |   258.47   |   118.80    |   452.93    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.85x    |    0.86x    |    0.88x    |
| aot_cudagraphs |   0.43x    |    0.38x    |    0.19x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.77x    |    0.82x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 1.0002 | 0.9102 | 0.0 | 1.397 | 5.0623 | | functorch_dp_cifar10 | 64 | 1.0015 | 0.9112 | 0.0 | 1.1939 | 4.737 | | timm_efficientdet | 1 | 0.9848 | 0.8085 | 0.0 | 0.0 | 4.2687 | | BERT_pytorch | 16 | 1.0107 | 0.8304 | 0.0 | 0.0 | 3.1041 | | timm_vision_transformer | 8 | 1.0006 | 0.846 | 0.0 | 1.3541 | 3.0679 | | drq | 1 | 1.0024 | 0.8093 | 0.0 | 1.106 | 2.9813 | | resnet18 | 16 | 1.0009 | 0.989 | 0.0 | 1.3483 | 2.6731 | | dcgan | 32 | 0.9772 | 0.9046 | 1.1443 | 0.7307 | 2.6188 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.998 | 0.9303 | 1.4873 | 1.2113 | 2.5972 | | hf_Albert | 8 | 1.0011 | 0.9552 | 0.0 | 0.0 | 2.3953 | | squeezenet1_1 | 32 | 0.9933 | 0.9562 | 1.337 | 1.1937 | 2.3116 | | resnext50_32x4d | 8 | 1.0027 | 0.9499 | 0.0 | 1.3374 | 2.1943 | | mobilenet_v3_large | 32 | 1.0042 | 1.0057 | 0.0 | 1.411 | 2.1611 | | hf_T5 | 8 | 0.9984 | 0.9446 | 0.0 | 0.0 | 2.1382 | | hf_T5_large | 2 | 1.0172 | 0.8568 | 0.0 | 0.0 | 2.122 | | pytorch_struct | 200 | 1.0012 | 0.7441 | 1.0266 | 0.9964 | 2.0323 | | hf_Bert | 4 | 1.0336 | 0.8486 | 0.0 | 0.0 | 1.8828 | | hf_GPT2 | 4 | 1.017 | 0.9879 | 0.0 | 0.0 | 1.8574 | | mnasnet1_0 | 32 | 0.9986 | 1.0159 | 0.9193 | 1.4046 | 1.7708 | | LearningToPaint | 96 | 1.0045 | 1.0023 | 0.0 | 1.3491 | 1.7422 | | hf_Bart | 4 | 1.0155 | 0.8359 | 0.0 | 0.0 | 1.7211 | | lennard_jones | 1000 | 0.9786 | 0.7278 | 1.2952 | 1.0447 | 1.5978 | | timm_efficientnet | 32 | 0.9608 | 0.8133 | 0.0 | 1.1851 | 1.5685 | | attention_is_all_you_need_pytorch | 256 | 1.0029 | 0.9032 | 0.0 | 0.0 | 1.5158 | | soft_actor_critic | 256 | 1.011 | 0.707 | 1.2513 | 1.0703 | 1.4902 | | hf_DistilBert | 8 | 1.0015 | 0.969 | 0.0 | 0.0 | 1.4765 | | fastNLP_Bert | 6 | 1.0004 | 0.8861 | 0.0 | 0.0 | 1.4585 | | shufflenet_v2_x1_0 | 128 | 1.0011 | 1.0157 | 0.0 | 1.3391 | 1.3717 | | pytorch_unet | 1 | 0.9999 | 0.9926 | 0.0 | 1.1552 | 1.3528 | | timm_nfnet | 128 | 0.9997 | 0.9985 | 0.0 | 1.1712 | 1.3388 | | pytorch_stargan | 16 | 0.9984 | 1.0165 | 0.8265 | 1.1173 | 1.3192 | | Super_SloMo | 6 | 0.9997 | 0.996 | 0.0 | 0.0 | 1.2905 | | vgg16 | 64 | 0.9997 | 0.9978 | 0.7975 | 0.9952 | 1.2744 | | Background_Matting | 4 | 0.9993 | 1.0175 | 0.0 | 1.1152 | 1.2167 | | alexnet | 128 | 0.9993 | 0.9971 | 0.788 | 1.0029 | 1.2085 | | timm_resnest | 32 | 0.9995 | 1.0217 | 0.0 | 1.3245 | 1.2011 | | hf_Reformer | 4 | 0.9924 | 0.9996 | 0.9192 | 0.0 | 1.1589 | | timm_vision_transformer_large | 8 | 0.9991 | 0.9895 | 0.0 | 0.9926 | 1.1581 | | hf_BigBird | 2 | 0.9986 | 0.9103 | 0.0 | 0.0 | 1.1491 | | timm_vovnet | 32 | 0.9212 | 0.8868 | 0.0 | 1.1273 | 1.1101 | | moco | 32 | 0.9968 | 0.0 | 0.0 | 0.0 | 1.0487 | | tts_angular | 64 | 0.9963 | 0.9382 | 0.9949 | 0.9984 | 1.0118 | | demucs | 4 | 0.9985 | 1.0008 | 0.9996 | 0.9991 | 1.0012 | | nvidia_deeprecommender | 256 | 0.9985 | 0.9955 | 0.6966 | 0.9787 | 0.9905 | | mobilenet_v2 | 96 | 0.9988 | 0.9875 | 0.0 | 0.9305 | 0.9033 | | resnet50 | 32 | 1.0012 | 1.0086 | 0.0 | 1.3687 | 0.8978 | | timm_regnet | 32 | 0.9812 | 0.9369 | 0.0 | 1.2152 | 0.7564 | | yolov3 | 16 | 0.9986 | 0.9886 | 0.0 | 0.9097 | 0.0 | | hf_Longformer | 2 | 0.9639 | 0.8829 | 0.8871 | 0.0 | 0.0 | | dlrm | 2048 | 0.0 | 1.2025 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9996 | 0.9898 | 0.0 | 0.0 | 0.0 | | speech_transformer | 32 | 1.0047 | 0.8518 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9796 | 0.7578 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 52.6602 | 79.1055 | nan | nan | 1818.1052 | | hf_T5_large | 2 | 36.3097 | 76.3336 | nan | nan | 1734.768 | | densenet121 | 4 | 13.5051 | 28.8872 | nan | 138.834 | 1386.3113 | | mnasnet1_0 | 32 | 3.3962 | 8.6081 | 43.4979 | 46.1998 | 826.0046 | | mobilenet_v3_large | 32 | 3.7847 | 9.2689 | nan | 75.0785 | 748.7895 | | resnext50_32x4d | 8 | 3.6093 | 9.3411 | nan | 39.3272 | 707.1618 | | moco | 32 | 11.37 | nan | nan | nan | 683.29 | | mobilenet_v2 | 96 | 3.3085 | 8.3878 | nan | 43.4906 | 635.5209 | | timm_efficientnet | 32 | 5.9808 | 12.6187 | nan | 73.2075 | 583.9074 | | shufflenet_v2_x1_0 | 128 | 3.8001 | 9.9697 | nan | 40.5941 | 398.4956 | | squeezenet1_1 | 32 | 0.6697 | 1.7814 | 6.7516 | 6.8886 | 385.1688 | | timm_nfnet | 128 | 6.6658 | 13.5281 | nan | 42.693 | 365.85 | | timm_resnest | 32 | 1.4486 | 4.3564 | nan | 43.3691 | 349.0304 | | timm_regnet | 32 | 8.429 | 16.9052 | nan | 67.531 | 317.4367 | | attention_is_all_you_need_pytorch | 256 | 4.4349 | 12.7302 | nan | nan | 251.7143 | | timm_vovnet | 32 | 3.013 | 7.3585 | nan | 32.2961 | 228.4544 | | timm_vision_transformer_large | 8 | 22.8303 | 40.7575 | nan | 59.3202 | 203.1062 | | LearningToPaint | 96 | 1.0884 | 3.1644 | nan | 30.8535 | 196.1728 | | functorch_dp_cifar10 | 64 | 0.8334 | 2.5709 | nan | 6.5105 | 186.6153 | | timm_vision_transformer | 8 | 3.1965 | 8.1056 | nan | 16.1745 | 185.5642 | | BERT_pytorch | 16 | 5.1553 | 13.6741 | nan | nan | 183.0714 | | resnet18 | 16 | 0.9908 | 3.0796 | nan | 23.6609 | 178.4714 | | resnet50 | 32 | 3.4622 | 9.1981 | nan | 44.2773 | 167.5082 | | fastNLP_Bert | 6 | 5.3662 | 12.8031 | nan | nan | 155.3766 | | hf_T5 | 8 | 3.9527 | 12.7903 | nan | nan | 152.7908 | | Background_Matting | 4 | 4.0586 | 9.8569 | nan | 45.5208 | 137.705 | | pytorch_stargan | 16 | 0.8563 | 3.2896 | 11.618 | 7.5638 | 137.6237 | | hf_Bart | 4 | 7.5098 | 17.1193 | nan | nan | 136.6578 | | hf_GPT2 | 4 | 3.6623 | 9.9582 | nan | nan | 128.4494 | | pytorch_struct | 200 | 0.4445 | 1.2679 | 1.8613 | 5.4827 | 121.6668 | | Super_SloMo | 6 | 2.3013 | 7.0704 | nan | nan | 91.5593 | | hf_Albert | 8 | 1.5093 | 8.5484 | nan | nan | 81.4949 | | hf_Reformer | 4 | 3.183 | 5.8886 | 13.7829 | nan | 80.2307 | | hf_Bert | 4 | 5.2581 | 12.5739 | nan | nan | 72.6783 | | hf_BigBird | 2 | 11.8968 | 20.2146 | nan | nan | 66.169 | | pytorch_unet | 1 | 1.1321 | 3.4759 | nan | 26.7798 | 61.8894 | | hf_DistilBert | 8 | 1.7875 | 5.408 | nan | nan | 47.2585 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.8066 | 3.255 | 11.9549 | 5.2307 | 33.7224 | | vgg16 | 64 | 0.3752 | 1.1111 | 4.1704 | 3.7078 | 29.9942 | | alexnet | 128 | 0.2813 | 0.6904 | 1.9896 | 3.2561 | 29.2225 | | drq | 1 | 0.2865 | 0.7521 | nan | 4.4815 | 22.5066 | | dcgan | 32 | 0.268 | 0.6336 | 1.8825 | 4.3205 | 17.1734 | | nvidia_deeprecommender | 256 | 0.2933 | 0.6746 | 1.0105 | 2.9894 | 15.6936 | | soft_actor_critic | 256 | 0.2749 | 0.4931 | 0.715 | 2.1025 | 14.6417 | | lennard_jones | 1000 | 0.2403 | 0.5118 | 0.6931 | 1.5472 | 8.5416 | | tts_angular | 64 | 0.3366 | 0.3937 | 0.5196 | 1.1651 | 4.0383 | | demucs | 4 | 0.9022 | 0.8912 | 0.8836 | 0.8907 | 0.789 | | yolov3 | 16 | 7.4472 | 15.7484 | nan | 45.481 | nan | | hf_Longformer | 2 | 11.7858 | 21.3262 | 90.6374 | nan | nan | | hf_GPT2_large | 4 | 21.8976 | 41.7361 | nan | nan | nan | | tacotron2 | 64 | 13.9023 | 30.239 | nan | nan | nan | | speech_transformer | 32 | 7.6548 | 17.41 | nan | nan | nan | | dlrm | 2048 | nan | 1.2125 | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | hf_Albert | 8 | 0.9814 | 0.936 | nan | nan | 1.1576 | | Super_SloMo | 6 | 1.0024 | 0.9697 | nan | nan | 1.1385 | | timm_nfnet | 128 | 0.9761 | 0.9043 | nan | 0.9504 | 1.0242 | | tts_angular | 64 | 1.0015 | 1.0015 | 0.9866 | 1.0015 | 0.9908 | | attention_is_all_you_need_pytorch | 256 | 0.9976 | 0.9403 | nan | nan | 0.9875 | | demucs | 4 | 0.987 | 0.987 | 0.987 | 0.987 | 0.987 | | timm_efficientdet | 1 | 1.0316 | 0.8425 | nan | nan | 0.9857 | | BERT_pytorch | 16 | 0.9998 | 0.8819 | nan | nan | 0.9728 | | timm_efficientnet | 32 | 0.9982 | 0.7762 | nan | 0.7936 | 0.9689 | | hf_GPT2 | 4 | 0.971 | 0.8627 | nan | nan | 0.9645 | | Background_Matting | 4 | 1.0201 | 0.9679 | nan | 0.987 | 0.9244 | | mobilenet_v2 | 96 | 1.0001 | 0.7725 | nan | 0.9235 | 0.8856 | | pytorch_unet | 1 | 0.9968 | 0.8677 | nan | 0.8518 | 0.8681 | | fastNLP_Bert | 6 | 1.0013 | 0.8966 | nan | nan | 0.8661 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8751 | 0.2642 | 0.8432 | 0.8602 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8535 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | nan | nan | 0.8387 | | hf_Bert | 4 | 0.9844 | 0.8677 | nan | nan | 0.8383 | | timm_regnet | 32 | 0.9999 | 0.8483 | nan | 0.85 | 0.8361 | | hf_Bart | 4 | 0.9099 | 0.8321 | nan | nan | 0.8151 | | hf_BigBird | 2 | 0.9852 | 0.9787 | nan | nan | 0.81 | | timm_vovnet | 32 | 0.9903 | 0.7754 | nan | 0.7817 | 0.7861 | | moco | 32 | 0.9667 | nan | nan | nan | 0.782 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 0.874 | nan | 0.8652 | 0.7812 | | pytorch_stargan | 16 | 0.9929 | 0.9799 | 0.2149 | 0.8882 | 0.7783 | | dcgan | 32 | 1.0 | 0.7949 | 0.343 | 0.7073 | 0.7527 | | vgg16 | 64 | 0.9998 | 0.7378 | 0.2978 | 0.7172 | 0.7491 | | timm_vision_transformer_large | 8 | 0.9987 | 0.8365 | nan | 0.8491 | 0.7487 | | alexnet | 128 | 1.0003 | 0.8082 | 0.4354 | 0.805 | 0.7352 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.7266 | | timm_resnest | 32 | 0.9868 | 0.8809 | nan | 0.8726 | 0.7218 | | timm_vision_transformer | 8 | 1.0001 | 0.8868 | nan | 0.8871 | 0.7151 | | resnet50 | 32 | 1.0004 | 0.8678 | nan | 0.8041 | 0.7143 | | mnasnet1_0 | 32 | 0.9994 | 0.8793 | 0.173 | 0.8217 | 0.6596 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.2951 | 0.7589 | 0.6595 | | mobilenet_v3_large | 32 | 0.999 | 0.8661 | nan | 0.874 | 0.6573 | | resnext50_32x4d | 8 | 1.0 | 0.8591 | nan | 0.823 | 0.6514 | | drq | 1 | 0.9125 | 0.8399 | nan | 0.8395 | 0.6406 | | soft_actor_critic | 256 | 0.964 | 0.9151 | 0.4737 | 0.9151 | 0.6279 | | LearningToPaint | 96 | 0.9252 | 0.7196 | nan | 0.71 | 0.605 | | densenet121 | 4 | 1.0 | 0.8696 | nan | 0.8376 | 0.5739 | | resnet18 | 16 | 0.9782 | 0.7852 | nan | 0.7268 | 0.5644 | | lennard_jones | 1000 | 1.0 | 1.0002 | 0.3735 | 1.0967 | 0.564 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5262 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8131 | nan | 0.846 | 0.4465 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | | hf_Reformer | 4 | 0.3764 | 1.0 | 0.2539 | nan | 0.3629 | | yolov3 | 16 | 1.0054 | 0.8488 | nan | 0.8244 | nan | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3374 | nan | nan | | speech_transformer | 32 | 1.0015 | 0.9177 | nan | nan | nan | | hf_GPT2_large | 4 | 0.9586 | 0.8649 | nan | nan | nan | | dlrm | 2048 | nan | 0.7282 | nan | nan | nan | | tacotron2 | 64 | 0.9879 | 0.4069 | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | MobileBertForQuestionAnswering | 32 | 1.0156 | 0.8169 | 0.0 | 0.0 | 5.8307 | | MobileBertForMaskedLM | 16 | 1.0187 | 0.8248 | 0.0 | 0.0 | 5.7089 | | MT5ForConditionalGeneration | 2 | 1.0224 | 0.8508 | 0.0 | 0.0 | 5.4709 | | ElectraForCausalLM | 1 | 1.0362 | 0.8465 | 0.0 | 0.0 | 5.4366 | | YituTechConvBert | 1 | 1.0208 | 0.8384 | 0.0 | 0.0 | 4.6274 | | MegatronBertForCausalLM | 2 | 1.0325 | 0.8502 | 0.0 | 0.0 | 4.1375 | | M2M100ForConditionalGeneration | 2 | 1.0103 | 0.8308 | 0.0 | 0.0 | 4.0046 | | RobertaForCausalLM | 4 | 1.0395 | 0.8381 | 0.0 | 0.0 | 3.9484 | | OPTForCausalLM | 4 | 1.0159 | 0.8275 | 0.0 | 0.0 | 3.9047 | | CamemBert | 1 | 1.0396 | 0.8447 | 0.0 | 0.0 | 3.4434 | | PegasusForConditionalGeneration | 4 | 1.0105 | 0.8149 | 0.0 | 0.0 | 3.2421 | | XGLMForCausalLM | 1 | 1.0117 | 0.8168 | 0.0 | 0.0 | 3.1117 | | PLBartForConditionalGeneration | 8 | 1.0154 | 0.8245 | 0.0 | 0.0 | 2.8361 | | MegatronBertForQuestionAnswering | 8 | 1.0376 | 0.859 | 0.0 | 0.0 | 2.688 | | DistillGPT2 | 1 | 1.024 | 0.8702 | 0.0 | 0.0 | 2.6104 | | MBartForConditionalGeneration | 8 | 1.0136 | 0.8357 | 0.0 | 0.0 | 2.3857 | | Speech2Text2ForCausalLM | 64 | 1.0051 | 0.8348 | 0.0 | 0.0 | 2.2561 | | GPT2ForSequenceClassification | 4 | 0.9993 | 0.9755 | 0.0 | 0.0 | 2.1462 | | ElectraForQuestionAnswering | 64 | 0.9999 | 0.9776 | 0.0 | 0.0 | 1.9724 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0098 | 0.8688 | 0.0 | 0.0 | 1.9514 | | TrOCRForCausalLM | 8 | 1.0113 | 0.829 | 0.0 | 0.0 | 1.9288 | | PegasusForCausalLM | 8 | 1.0103 | 0.8014 | 0.0 | 0.0 | 1.8377 | | DistilBertForMaskedLM | 16 | 1.0299 | 0.8455 | 0.0 | 0.0 | 1.8339 | | BartForConditionalGeneration | 1 | 1.0151 | 0.8364 | 0.0 | 0.0 | 1.7748 | | DistilBertForQuestionAnswering | 32 | 1.034 | 0.8491 | 0.0 | 0.0 | 1.7693 | | LayoutLMForSequenceClassification | 16 | 0.9972 | 0.9671 | 0.0 | 0.0 | 1.7319 | | T5ForConditionalGeneration | 4 | 1.0002 | 0.9362 | 0.0 | 0.0 | 1.7017 | | AlbertForQuestionAnswering | 2 | 1.0011 | 0.808 | 0.0 | 0.0 | 1.6617 | | AlbertForMaskedLM | 2 | 1.0004 | 0.808 | 0.0 | 0.0 | 1.6509 | | PLBartForCausalLM | 16 | 1.0101 | 0.9365 | 0.0 | 0.0 | 1.6438 | | T5Small | 1 | 1.0281 | 0.8763 | 0.0 | 0.0 | 1.6266 | | XLNetLMHeadModel | 4 | 1.0006 | 0.9605 | 0.0 | 0.0 | 1.5968 | | LayoutLMForMaskedLM | 16 | 0.9985 | 0.969 | 0.0 | 0.0 | 1.5917 | | BartForCausalLM | 2 | 1.0003 | 0.9618 | 0.0 | 0.0 | 1.4597 | | DebertaForQuestionAnswering | 4 | 0.9344 | 0.7279 | 0.9349 | 0.0 | 1.4504 | | BertForQuestionAnswering | 64 | 0.9972 | 0.9677 | 0.0 | 0.0 | 1.446 | | RobertaForQuestionAnswering | 64 | 0.9979 | 0.9686 | 0.0 | 0.0 | 1.4407 | | DebertaForMaskedLM | 4 | 0.9334 | 0.7268 | 0.7967 | 0.0 | 1.4123 | | MBartForCausalLM | 16 | 1.0091 | 0.823 | 0.0 | 0.0 | 1.3982 | | BertForMaskedLM | 64 | 0.9973 | 0.956 | 0.0 | 0.0 | 1.3317 | | BlenderbotSmallForCausalLM | 64 | 1.0004 | 0.927 | 0.0 | 0.0 | 1.3061 | | BigBird | 1 | 0.9924 | 0.9078 | 0.0 | 0.0 | 1.1488 | | AllenaiLongformerBase | 1 | 0.9546 | 0.7324 | 0.854 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 17.5696 | 42.9354 | nan | nan | 327.9079 | | MobileBertForMaskedLM | 16 | 135.8255 | 173.7536 | nan | nan | 308.991 | | MobileBertForQuestionAnswering | 32 | 132.8536 | 171.7633 | nan | nan | 293.2592 | | T5ForConditionalGeneration | 4 | 3.922 | 12.943 | nan | nan | 248.2661 | | M2M100ForConditionalGeneration | 2 | 26.2972 | 44.8597 | nan | nan | 223.5051 | | MT5ForConditionalGeneration | 2 | 6.6742 | 19.818 | nan | nan | 203.911 | | YituTechConvBert | 1 | 9.2675 | 20.8203 | nan | nan | 195.7945 | | MBartForConditionalGeneration | 8 | 26.6741 | 47.2119 | nan | nan | 173.1357 | | XGLMForCausalLM | 1 | 15.5248 | 30.2576 | nan | nan | 170.6504 | | PegasusForConditionalGeneration | 4 | 26.1067 | 45.261 | nan | nan | 167.2376 | | DebertaForMaskedLM | 4 | 7.4344 | 14.5652 | 53.2994 | nan | 164.457 | | BartForConditionalGeneration | 1 | 26.3695 | 45.964 | nan | nan | 155.1766 | | MegatronBertForCausalLM | 2 | 16.6556 | 31.6041 | nan | nan | 151.6307 | | MegatronBertForQuestionAnswering | 8 | 16.8169 | 31.9636 | nan | nan | 149.1209 | | T5Small | 1 | 3.949 | 12.7818 | nan | nan | 148.9024 | | PLBartForConditionalGeneration | 8 | 7.4476 | 17.4072 | nan | nan | 135.8023 | | BlenderbotSmallForConditionalGeneration | 32 | 12.7961 | 25.3891 | nan | nan | 127.313 | | DebertaForQuestionAnswering | 4 | 7.1998 | 14.5172 | 53.5221 | nan | 122.565 | | RobertaForCausalLM | 4 | 5.2604 | 13.0452 | nan | nan | 104.3819 | | LayoutLMForSequenceClassification | 16 | 5.4858 | 12.9984 | nan | nan | 93.8895 | | PegasusForCausalLM | 8 | 9.9544 | 17.0666 | nan | nan | 92.9066 | | ElectraForQuestionAnswering | 64 | 5.2746 | 12.8697 | nan | nan | 92.0522 | | MBartForCausalLM | 16 | 10.398 | 17.1343 | nan | nan | 85.7616 | | OPTForCausalLM | 4 | 4.9946 | 12.1428 | nan | nan | 84.3122 | | BertForMaskedLM | 64 | 5.1354 | 12.5731 | nan | nan | 84.208 | | LayoutLMForMaskedLM | 16 | 5.6794 | 13.827 | nan | nan | 82.2904 | | BartForCausalLM | 2 | 10.0323 | 17.1562 | nan | nan | 81.4293 | | GPT2ForSequenceClassification | 4 | 3.6793 | 10.0043 | nan | nan | 78.5398 | | TrOCRForCausalLM | 8 | 9.9941 | 17.2084 | nan | nan | 73.7194 | | BlenderbotSmallForCausalLM | 64 | 4.8936 | 9.6694 | nan | nan | 73.5185 | | ElectraForCausalLM | 1 | 5.385 | 12.7792 | nan | nan | 70.1919 | | BigBird | 1 | 11.6176 | 20.1569 | nan | nan | 67.9323 | | DistilBertForQuestionAnswering | 32 | 1.921 | 5.3883 | nan | nan | 67.798 | | Speech2Text2ForCausalLM | 64 | 3.2046 | 6.8548 | nan | nan | 67.6319 | | AlbertForMaskedLM | 2 | 1.5751 | 8.7428 | nan | nan | 67.5946 | | DistillGPT2 | 1 | 1.5417 | 4.7305 | nan | nan | 66.8776 | | PLBartForCausalLM | 16 | 3.3208 | 7.2321 | nan | nan | 65.4376 | | CamemBert | 1 | 5.225 | 12.6121 | nan | nan | 64.2281 | | RobertaForQuestionAnswering | 64 | 5.5016 | 12.5104 | nan | nan | 62.6216 | | BertForQuestionAnswering | 64 | 5.228 | 12.4546 | nan | nan | 61.755 | | DistilBertForMaskedLM | 16 | 1.9566 | 5.6066 | nan | nan | 51.4273 | | AlbertForQuestionAnswering | 2 | 1.6953 | 8.6595 | nan | nan | 45.7502 | | AllenaiLongformerBase | 1 | 12.2056 | 22.4955 | 93.5262 | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9163 | nan | nan | 1.07 | | XLNetLMHeadModel | 4 | 0.9912 | 0.8791 | nan | nan | 1.0109 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9539 | nan | nan | 1.0002 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9876 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | nan | nan | 0.9871 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | nan | nan | 0.9811 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | nan | nan | 0.9712 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9998 | 0.8996 | nan | nan | 0.9557 | | BartForCausalLM | 2 | 1.0 | 0.8769 | nan | nan | 0.9545 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.9525 | | Speech2Text2ForCausalLM | 64 | 0.9954 | 0.8489 | nan | nan | 0.9452 | | PLBartForCausalLM | 16 | 1.0006 | 0.8667 | nan | nan | 0.9395 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | | BertForQuestionAnswering | 64 | 0.9995 | 0.9315 | nan | nan | 0.9256 | | RobertaForQuestionAnswering | 64 | 0.9996 | 0.9315 | nan | nan | 0.9254 | | DistilBertForMaskedLM | 16 | 0.9991 | 0.8698 | nan | nan | 0.9167 | | BartForConditionalGeneration | 1 | 1.0 | 0.8619 | nan | nan | 0.881 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.6451 | nan | nan | 0.8636 | | MBartForCausalLM | 16 | 1.0 | 0.8398 | nan | nan | 0.8565 | | AlbertForMaskedLM | 2 | 1.0 | 0.6364 | nan | nan | 0.8515 | | BigBird | 1 | 1.0024 | 0.9513 | nan | nan | 0.8349 | | DistilBertForQuestionAnswering | 32 | 0.9987 | 0.8967 | nan | nan | 0.834 | | PLBartForConditionalGeneration | 8 | 0.9999 | 0.8307 | nan | nan | 0.8252 | | DistillGPT2 | 1 | 1.0006 | 0.7548 | nan | nan | 0.812 | | MBartForConditionalGeneration | 8 | 0.9999 | 0.8187 | nan | nan | 0.7699 | | TrOCRForCausalLM | 8 | 1.0 | 0.7955 | nan | nan | 0.7566 | | CamemBert | 1 | 0.9989 | 0.7872 | nan | nan | 0.7482 | | OPTForCausalLM | 4 | 0.9975 | 0.7501 | nan | nan | 0.7473 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.7407 | | PegasusForCausalLM | 8 | 0.999 | 0.9444 | nan | nan | 0.7324 | | RobertaForCausalLM | 4 | 0.9237 | 0.7741 | nan | nan | 0.7309 | | XGLMForCausalLM | 1 | 0.9999 | 0.9992 | nan | nan | 0.7214 | | MegatronBertForQuestionAnswering | 8 | 0.9051 | 0.8218 | nan | nan | 0.7107 | | MobileBertForMaskedLM | 16 | 0.9985 | 0.8983 | nan | nan | 0.6948 | | PegasusForConditionalGeneration | 4 | 0.9996 | 0.9196 | nan | nan | 0.6769 | | ElectraForCausalLM | 1 | 0.9993 | 0.8955 | nan | nan | 0.6701 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.7726 | nan | nan | 0.6697 | | M2M100ForConditionalGeneration | 2 | 0.9999 | 0.954 | nan | nan | 0.6523 | | MobileBertForQuestionAnswering | 32 | 1.0142 | 0.9796 | nan | nan | 0.6265 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | nan | nan | 0.6019 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9826 | 0.3599 | nan | 0.4498 | | DebertaForQuestionAnswering | 4 | 0.979 | 1.0568 | 0.3576 | nan | 0.3761 | | AllenaiLongformerBase | 1 | 0.9996 | 0.9477 | 0.3752 | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 0.9966 | 0.8973 | 0.0 | 1.3904 | 5.5892 | | hrnet_w18 | 2 | 1.004 | 0.9636 | 0.0 | 1.3727 | 4.9403 | | res2next50 | 2 | 1.0004 | 0.9702 | 0.0 | 1.363 | 4.678 | | twins_pcpvt_base | 32 | 1.0036 | 0.8938 | 0.0 | 1.3592 | 2.5448 | | xcit_large_24_p8_224 | 5 | 1.0012 | 0.0 | 0.0 | 0.0 | 2.0556 | | cait_m36_384 | 2 | 1.0024 | 0.8465 | 0.0 | 1.3421 | 2.0541 | | tnt_s_patch16_224 | 64 | 0.9997 | 0.9927 | 0.0 | 1.8446 | 2.0203 | | ghostnet_100 | 128 | 1.0043 | 0.9984 | 0.0 | 1.5386 | 1.8112 | | gmixer_24_224 | 64 | 1.0008 | 0.8843 | 0.0 | 1.0368 | 1.6807 | | volo_d1_224 | 64 | 0.9994 | 0.9943 | 0.0 | 1.1498 | 1.6678 | | crossvit_9_240 | 64 | 1.0032 | 0.9572 | 0.0 | 1.1315 | 1.5867 | | nfnet_l0 | 64 | 1.0066 | 0.8388 | 0.0 | 1.1434 | 1.5833 | | swin_base_patch4_window7_224 | 64 | 0.9993 | 0.961 | 0.0 | 1.0563 | 1.5723 | | lcnet_050 | 128 | 0.9684 | 0.9499 | 0.0 | 1.5746 | 1.5519 | | coat_lite_mini | 128 | 1.0002 | 0.9947 | 0.0 | 1.2658 | 1.5316 | | regnety_002 | 128 | 0.9786 | 0.9364 | 0.0 | 1.3847 | 1.5049 | | resnest101e | 32 | 1.0032 | 0.9843 | 0.0 | 1.4186 | 1.4787 | | resmlp_12_224 | 128 | 1.0 | 0.9975 | 0.7819 | 0.0 | 1.4644 | | jx_nest_base | 32 | 0.9992 | 0.9909 | 0.0 | 1.238 | 1.4634 | | convit_base | 32 | 0.9995 | 0.9916 | 0.0 | 0.0 | 1.3992 | | gmlp_s16_224 | 64 | 0.9989 | 0.9827 | 0.0 | 1.051 | 1.3904 | | pit_b_224 | 64 | 0.9997 | 0.9939 | 0.0 | 1.0687 | 1.3644 | | dm_nfnet_f0 | 128 | 0.9993 | 0.9976 | 0.0 | 1.1757 | 1.326 | | mixer_b16_224 | 64 | 0.9992 | 0.9907 | 0.7171 | 0.9682 | 1.3168 | | deit_base_distilled_patch16_224 | 64 | 0.9994 | 0.9911 | 0.0 | 1.071 | 1.2892 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9783 | 0.0 | 1.0509 | 1.2862 | | adv_inception_v3 | 128 | 0.9998 | 0.9953 | 0.0 | 1.1938 | 1.2801 | | gluon_inception_v3 | 128 | 1.0 | 0.9948 | 0.0 | 1.1944 | 1.2254 | | poolformer_m36 | 64 | 0.999 | 0.9974 | 0.0 | 0.0 | 1.209 | | inception_v3 | 128 | 0.9999 | 0.995 | 0.0 | 1.1944 | 1.2078 | | mobilevit_s | 32 | 0.9736 | 0.7981 | 0.0 | 1.2122 | 1.2009 | | vit_base_patch16_224 | 64 | 0.9995 | 0.9934 | 0.0 | 1.0006 | 1.1978 | | mixnet_l | 64 | 0.9791 | 0.8892 | 0.0 | 1.0867 | 1.178 | | tf_mixnet_l | 64 | 0.9808 | 0.894 | 0.0 | 1.1188 | 1.1177 | | visformer_small | 128 | 0.9999 | 1.0005 | 0.0 | 1.0857 | 1.0997 | | pnasnet5large | 16 | 1.0052 | 1.0336 | 0.0 | 1.1349 | 1.052 | | fbnetv3_b | 128 | 0.9596 | 0.9445 | 0.0 | 1.2915 | 1.0325 | | dla102 | 64 | 1.0033 | 0.9902 | 0.0 | 1.3766 | 1.0242 | | dpn107 | 32 | 0.9389 | 0.9299 | 0.0 | 0.9938 | 0.9342 | | repvgg_a2 | 128 | 0.9422 | 0.9332 | 0.6563 | 1.1301 | 0.9011 | | fbnetc_100 | 128 | 0.952 | 0.9423 | 0.6644 | 1.3738 | 0.8982 | | selecsls42b | 128 | 0.9998 | 0.9936 | 0.0 | 1.3554 | 0.8981 | | cspdarknet53 | 64 | 0.9431 | 0.9323 | 0.0 | 0.9006 | 0.8892 | | convmixer_768_32 | 32 | 0.9998 | 0.9979 | 0.0 | 1.0527 | 0.8866 | | tinynet_a | 128 | 0.9575 | 0.8062 | 0.0 | 1.0907 | 0.8775 | | mnasnet_100 | 128 | 0.9523 | 0.9433 | 0.6613 | 1.3688 | 0.8396 | | convnext_base | 32 | 1.0041 | 0.9226 | 0.0 | 1.3138 | 0.8371 | | mobilenetv3_large_100 | 128 | 0.9548 | 0.9437 | 0.0 | 1.3436 | 0.8349 | | res2net101_26w_4s | 64 | 1.0 | 0.9969 | 0.0 | 1.3864 | 0.8124 | | gernet_l | 128 | 0.9461 | 0.9361 | 0.0 | 1.1391 | 0.8112 | | spnasnet_100 | 128 | 0.9468 | 0.9375 | 0.6531 | 1.3174 | 0.7948 | | mobilenetv2_100 | 128 | 0.9504 | 0.9396 | 0.0 | 0.8657 | 0.7434 | | sebotnet33ts_256 | 64 | 0.9669 | 0.8365 | 0.0 | 1.1144 | 0.734 | | tf_efficientnet_b0 | 128 | 0.9647 | 0.8063 | 0.0 | 1.0946 | 0.7245 | | botnet26t_256 | 128 | 0.9792 | 0.9756 | 0.0 | 1.3411 | 0.7229 | | eca_halonext26ts | 64 | 0.9636 | 0.8061 | 0.0 | 1.0992 | 0.705 | | eca_botnext26ts_256 | 64 | 0.9616 | 0.8005 | 0.0 | 1.1086 | 0.6749 | | rexnet_100 | 128 | 0.9646 | 0.8483 | 0.0 | 1.0366 | 0.6448 | | ese_vovnet19b_dw | 128 | 0.9691 | 0.9642 | 0.0 | 1.2435 | 0.6419 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.9801 | 0.0 | 1.0755 | 0.6057 | | gluon_xception65 | 32 | 0.9985 | 0.9876 | 0.0 | 1.0635 | 0.5872 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | fbnetc_100 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | | gmixer_24_224 | 2 | pass | pass | pass | fail_accuracy | pass | | gmlp_s16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | gluon_xception65 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 99.5471 | 142.1533 | nan | 471.0836 | 1399.7866 | | dpn107 | 32 | 13.8554 | 28.7483 | nan | 112.6905 | 1352.9515 | | pnasnet5large | 16 | 60.4879 | 88.5364 | nan | 251.1834 | 1340.2779 | | rexnet_100 | 128 | 6.8038 | 14.4292 | nan | 120.8599 | 1069.0229 | | res2net50_14w_8s | 2 | 20.6816 | 38.8763 | nan | 121.3153 | 987.9243 | | ghostnet_100 | 128 | 9.5437 | 19.3356 | nan | 96.7996 | 882.697 | | mobilevit_s | 32 | 5.9457 | 13.7479 | nan | 61.5465 | 879.8939 | | twins_pcpvt_base | 32 | 26.7453 | 43.9793 | nan | 95.4658 | 843.4319 | | eca_botnext26ts_256 | 64 | 2.6274 | 7.452 | nan | 63.6443 | 839.8298 | | mixnet_l | 64 | 13.5416 | 23.0219 | nan | 88.4897 | 835.172 | | fbnetv3_b | 128 | 13.3888 | 24.1748 | nan | 109.7611 | 772.9081 | | tinynet_a | 128 | 7.7469 | 15.6905 | nan | 83.963 | 743.7648 | | resnest101e | 32 | 27.3378 | 47.7018 | nan | 125.8207 | 700.9297 | | sebotnet33ts_256 | 64 | 3.961 | 10.0038 | nan | 69.1966 | 648.9216 | | fbnetc_100 | 128 | 5.7278 | 12.479 | 85.5777 | 63.2472 | 638.7629 | | coat_lite_mini | 128 | 3.266 | 9.1575 | nan | 34.2188 | 636.5705 | | botnet26t_256 | 128 | 2.4678 | 6.7299 | nan | 51.027 | 588.0481 | | tf_mixnet_l | 64 | 13.7087 | 23.521 | nan | 89.4409 | 565.3166 | | dla102 | 64 | 10.8136 | 22.7803 | nan | 96.3515 | 540.0435 | | eca_halonext26ts | 64 | 2.7442 | 7.8035 | nan | 67.5506 | 524.7978 | | cspdarknet53 | 64 | 6.1577 | 13.819 | nan | 44.4818 | 516.8901 | | res2next50 | 2 | 7.5752 | 17.4257 | nan | 64.6808 | 508.6052 | | mnasnet_100 | 128 | 4.2555 | 9.7943 | 61.6838 | 53.8161 | 460.7911 | | tf_efficientnet_b0 | 128 | 5.9847 | 12.8606 | nan | 81.5061 | 453.8244 | | convnext_base | 32 | 11.9958 | 19.1913 | nan | 46.8047 | 447.0484 | | res2net101_26w_4s | 64 | 26.2277 | 46.854 | nan | 142.2874 | 442.9257 | | swin_base_patch4_window7_224 | 64 | 12.9152 | 26.0704 | nan | 83.2757 | 431.9341 | | adv_inception_v3 | 128 | 8.7126 | 19.1453 | nan | 105.887 | 431.2125 | | nfnet_l0 | 64 | 6.0435 | 13.1945 | nan | 38.7061 | 399.3298 | | mobilenetv3_large_100 | 128 | 4.5585 | 10.0823 | nan | 83.9393 | 397.9304 | | mobilenetv2_100 | 128 | 4.1958 | 9.4317 | nan | 43.0148 | 392.8332 | | regnety_002 | 128 | 4.9475 | 10.8999 | nan | 60.0576 | 392.0666 | | ese_vovnet19b_dw | 128 | 2.0528 | 5.1975 | nan | 39.6613 | 388.7989 | | visformer_small | 128 | 2.5656 | 6.5413 | nan | 31.5689 | 381.6193 | | xcit_large_24_p8_224 | 5 | 37.375 | nan | nan | nan | 352.6873 | | gluon_xception65 | 32 | 15.5262 | 29.4819 | nan | 78.65 | 347.3151 | | jx_nest_base | 32 | 9.7428 | 20.6265 | nan | 58.19 | 320.1488 | | cait_m36_384 | 2 | 47.7696 | 73.0288 | nan | 109.6265 | 306.2536 | | poolformer_m36 | 64 | 13.3601 | 21.2852 | nan | nan | 303.6845 | | gernet_l | 128 | 4.9595 | 11.1233 | nan | 47.9657 | 292.4524 | | crossvit_9_240 | 64 | 7.76 | 17.4312 | nan | 42.3485 | 281.7218 | | selecsls42b | 128 | 2.5011 | 6.9082 | nan | 51.4771 | 280.0613 | | gluon_inception_v3 | 128 | 8.5703 | 18.794 | nan | 105.7343 | 276.2485 | | spnasnet_100 | 128 | 5.5887 | 12.1725 | 81.9197 | 60.788 | 274.2042 | | lcnet_050 | 128 | 2.0128 | 5.131 | nan | 39.9489 | 244.1516 | | inception_v3 | 128 | 8.5271 | 18.8796 | nan | 106.0345 | 223.4377 | | swsl_resnext101_32x16d | 32 | 10.3836 | 22.0073 | nan | 61.957 | 221.0566 | | volo_d1_224 | 64 | 6.7957 | 15.1805 | nan | 44.0256 | 211.2936 | | convit_base | 32 | 4.3998 | 10.9888 | nan | nan | 182.1969 | | pit_b_224 | 64 | 3.947 | 10.0387 | nan | 27.9232 | 182.0427 | | tnt_s_patch16_224 | 64 | 12.8349 | 25.0047 | nan | 48.7462 | 166.9154 | | gmlp_s16_224 | 64 | 9.5498 | 17.7147 | nan | 30.2121 | 151.1543 | | gmixer_24_224 | 64 | 8.8075 | 17.6468 | nan | 35.1994 | 141.2526 | | repvgg_a2 | 128 | 4.8536 | 10.7212 | 53.3933 | 65.0058 | 138.1295 | | dm_nfnet_f0 | 128 | 6.6351 | 13.4685 | nan | 42.0166 | 133.3581 | | resmlp_12_224 | 128 | 2.9325 | 6.0916 | 10.2012 | nan | 100.1161 | | mixer_b16_224 | 64 | 2.9455 | 7.0706 | 17.11 | 18.0814 | 96.6917 | | beit_base_patch16_224 | 64 | 4.9563 | 10.3961 | nan | 21.2115 | 91.1698 | | convmixer_768_32 | 32 | 7.1212 | 14.8516 | nan | 23.9174 | 89.8166 | | deit_base_distilled_patch16_224 | 64 | 3.2165 | 8.1922 | nan | 16.7035 | 84.6066 | | vit_base_patch16_224 | 64 | 3.0903 | 7.8695 | nan | 16.0406 | 71.5226 | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 1.0001 | 0.9563 | nan | 0.8998 | 1.2577 | | gmlp_s16_224 | 64 | 1.0 | 0.9679 | nan | 0.92 | 1.2405 | | tinynet_a | 128 | 1.0001 | 0.7955 | nan | 0.7958 | 1.1632 | | pnasnet5large | 16 | 1.0583 | 0.9923 | nan | 1.1741 | 1.1265 | | eca_halonext26ts | 64 | 0.999 | 0.7814 | nan | 0.786 | 1.0887 | | dm_nfnet_f0 | 128 | 0.9758 | 0.9039 | nan | 0.95 | 1.0616 | | tnt_s_patch16_224 | 64 | 1.0 | 0.9718 | nan | 0.9431 | 1.0587 | | volo_d1_224 | 64 | 1.0015 | 0.9518 | nan | 0.8587 | 1.0378 | | convit_base | 32 | 0.9991 | 0.86 | nan | nan | 1.0309 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9367 | nan | 0.9298 | 1.0097 | | mobilevit_s | 32 | 1.0 | 0.7722 | nan | 0.787 | 1.0078 | | rexnet_100 | 128 | 0.9988 | 0.7919 | nan | 0.8648 | 1.0009 | | dla102 | 64 | 0.9998 | 0.9549 | nan | 0.9751 | 0.997 | | pit_b_224 | 64 | 1.0021 | 0.8074 | nan | 0.8179 | 0.9856 | | poolformer_m36 | 64 | 1.0015 | 0.9462 | nan | nan | 0.9797 | | convnext_base | 32 | 1.0065 | 0.908 | nan | 0.7521 | 0.9564 | | twins_pcpvt_base | 32 | 0.9963 | 0.9079 | nan | 0.8007 | 0.9553 | | convmixer_768_32 | 32 | 0.9992 | 0.9807 | nan | 0.9715 | 0.9508 | | visformer_small | 128 | 0.9899 | 0.9353 | nan | 0.8884 | 0.9342 | | resnest101e | 32 | 1.0002 | 0.9762 | nan | 0.9535 | 0.9292 | | tf_mixnet_l | 64 | 0.9995 | 0.8624 | nan | 0.8426 | 0.9291 | | mixer_b16_224 | 64 | 0.9929 | 0.9425 | 0.2532 | 0.7726 | 0.9225 | | tf_efficientnet_b0 | 128 | 1.0006 | 0.7769 | nan | 0.846 | 0.9189 | | nfnet_l0 | 64 | 0.9993 | 0.824 | nan | 0.8257 | 0.913 | | mobilenetv2_100 | 128 | 0.9992 | 0.7716 | nan | 0.9249 | 0.8963 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9384 | nan | 0.8801 | 0.8916 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9376 | nan | 0.8794 | 0.8911 | | mobilenetv3_large_100 | 128 | 0.9987 | 0.8562 | nan | 0.8673 | 0.8886 | | adv_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_xception65 | 32 | 1.0 | 0.8895 | nan | 0.8854 | 0.8713 | | dpn107 | 32 | 0.9981 | 0.9115 | nan | 0.8834 | 0.8701 | | selecsls42b | 128 | 0.9789 | 0.8913 | nan | 0.8811 | 0.8659 | | fbnetv3_b | 128 | 1.0003 | 0.7918 | nan | 0.7903 | 0.8645 | | mixnet_l | 64 | 0.9989 | 0.8507 | nan | 0.7796 | 0.8601 | | spnasnet_100 | 128 | 0.9988 | 0.8961 | 0.1651 | 0.8371 | 0.8599 | | eca_botnext26ts_256 | 64 | 0.9998 | 0.7776 | nan | 0.7813 | 0.8532 | | swsl_resnext101_32x16d | 32 | 1.0009 | 0.8805 | nan | 0.8487 | 0.8523 | | xcit_large_24_p8_224 | 5 | 0.9987 | nan | nan | nan | 0.8489 | | resmlp_12_224 | 128 | 0.9827 | 0.9667 | 0.2637 | nan | 0.845 | | ghostnet_100 | 128 | 1.0013 | 0.8903 | nan | 0.9244 | 0.833 | | coat_lite_mini | 128 | 1.0338 | 0.929 | nan | 0.6593 | 0.8328 | | ese_vovnet19b_dw | 128 | 1.0 | 0.867 | nan | 0.9146 | 0.8269 | | cspdarknet53 | 64 | 1.0 | 0.8469 | nan | 0.7906 | 0.813 | | cait_m36_384 | 2 | 0.9998 | 0.8806 | nan | 0.9023 | 0.8081 | | jx_nest_base | 32 | 1.0 | 0.8945 | nan | 0.86 | 0.8 | | crossvit_9_240 | 64 | 1.0008 | 0.8801 | nan | 0.8854 | 0.7933 | | res2net101_26w_4s | 64 | 0.9999 | 0.9202 | nan | 0.8569 | 0.7834 | | mnasnet_100 | 128 | 0.9993 | 0.8882 | 0.1669 | 0.8253 | 0.773 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9234 | nan | 0.8451 | 0.7676 | | sebotnet33ts_256 | 64 | 0.9999 | 0.7108 | nan | 0.7354 | 0.7449 | | gernet_l | 128 | 0.9998 | 0.8655 | nan | 0.8299 | 0.7238 | | fbnetc_100 | 128 | 0.9984 | 0.8631 | 0.1626 | 0.7352 | 0.7104 | | lcnet_050 | 128 | 0.9992 | 0.7927 | nan | 0.7885 | 0.705 | | regnety_002 | 128 | 0.9994 | 0.8284 | nan | 0.7819 | 0.6975 | | botnet26t_256 | 128 | 1.0 | 0.8755 | nan | 0.78 | 0.6616 | | res2next50 | 2 | 1.0 | 0.8301 | nan | 0.8198 | 0.6012 | | res2net50_14w_8s | 2 | 1.0 | 0.8275 | nan | 0.8169 | 0.5927 | | hrnet_w18 | 2 | 1.0 | 0.8383 | nan | 0.8363 | 0.5746 | | repvgg_a2 | 128 | 1.0003 | 0.7971 | 0.1444 | 0.6902 | 0.5572 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/ASGiM8I.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/xWHHy9P.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/E4sJ4qZ.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager | 98%, 42/43  |
| inductor  | 84%, 36/43  |
+-----------+-------------+

Geometric mean speedup

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    1.00x    |
| inductor  |    2.25x    |
+-----------+-------------+

Mean compilation time (seconds)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    25.89    |
| inductor  |    87.60    |
+-----------+-------------+

Peak memory footprint compression ratio (higher is better)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    0.86x    |
| inductor  |    0.83x    |
+-----------+-------------+

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/guTTUnl.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/sSBt7XO.png)

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | ElectraForCausalLM | 1 | 0.8457 | 6.3238 | | MobileBertForMaskedLM | 16 | 0.8464 | 6.2119 | | MT5ForConditionalGeneration | 2 | 0.8564 | 5.4336 | | MobileBertForQuestionAnswering | 32 | 0.8237 | 5.0662 | | YituTechConvBert | 1 | 0.8386 | 4.728 | | MegatronBertForCausalLM | 2 | 0.8526 | 4.216 | | OPTForCausalLM | 4 | 0.8238 | 3.9855 | | RobertaForCausalLM | 4 | 0.8428 | 3.9209 | | PegasusForConditionalGeneration | 4 | 0.8341 | 3.8129 | | M2M100ForConditionalGeneration | 2 | 0.815 | 3.7545 | | XGLMForCausalLM | 1 | 0.8097 | 3.6427 | | CamemBert | 1 | 0.8502 | 3.4717 | | PLBartForConditionalGeneration | 8 | 0.8242 | 3.2557 | | MegatronBertForQuestionAnswering | 8 | 0.8588 | 3.1214 | | DistillGPT2 | 1 | 0.8651 | 2.624 | | MBartForConditionalGeneration | 8 | 0.8457 | 2.4031 | | GPT2ForSequenceClassification | 4 | 0.9749 | 2.1444 | | Speech2Text2ForCausalLM | 64 | 0.8355 | 2.0987 | | ElectraForQuestionAnswering | 64 | 0.9657 | 1.9725 | | TrOCRForCausalLM | 8 | 0.8338 | 1.9162 | | T5Small | 1 | 0.8839 | 1.8796 | | DistilBertForMaskedLM | 16 | 0.8498 | 1.8745 | | PegasusForCausalLM | 8 | 0.8044 | 1.8467 | | BlenderbotSmallForConditionalGeneration | 32 | 0.8895 | 1.7928 | | BartForConditionalGeneration | 1 | 0.8342 | 1.7886 | | DistilBertForQuestionAnswering | 32 | 0.8476 | 1.7631 | | LayoutLMForSequenceClassification | 16 | 0.9786 | 1.7266 | | T5ForConditionalGeneration | 4 | 0.9354 | 1.6896 | | AlbertForQuestionAnswering | 2 | 0.8084 | 1.6586 | | AlbertForMaskedLM | 2 | 0.8084 | 1.6458 | | XLNetLMHeadModel | 4 | 0.9599 | 1.5933 | | PLBartForCausalLM | 16 | 0.9311 | 1.5454 | | DebertaForQuestionAnswering | 4 | 0.7242 | 1.4752 | | BartForCausalLM | 2 | 0.963 | 1.4663 | | RobertaForQuestionAnswering | 64 | 0.9577 | 1.4464 | | BertForQuestionAnswering | 64 | 0.9665 | 1.4415 | | MBartForCausalLM | 16 | 0.8988 | 1.4018 | | DebertaForMaskedLM | 4 | 0.729 | 1.3922 | | BertForMaskedLM | 64 | 0.9562 | 1.3337 | | BlenderbotSmallForCausalLM | 64 | 0.9249 | 1.303 | | BigBird | 1 | 0.9124 | 1.1505 | | LayoutLMForMaskedLM | 16 | 0.9693 | 0.0 | | AllenaiLongformerBase | 1 | 0.7271 | 0.0 | +-----------------------------------------+----+-----------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-----------+-------------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+-------------+ | AlbertForMaskedLM | 1 | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | | BartForCausalLM | 1 | pass | pass | | BertForMaskedLM | 1 | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | | BigBird | 1 | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | | CamemBert | 1 | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | | DistillGPT2 | 1 | pass | pass | | ElectraForCausalLM | 1 | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | | MBartForCausalLM | 1 | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | | OPTForCausalLM | 1 | pass | pass | | PLBartForCausalLM | 1 | pass | pass | | PegasusForCausalLM | 1 | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | | RobertaForCausalLM | 1 | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | | T5Small | 1 | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | | YituTechConvBert | 1 | pass | pass | | AllenaiLongformerBase | 1 | pass | fail_to_run | | BartForConditionalGeneration | 1 | pass | fail_to_run | | MBartForConditionalGeneration | 1 | pass | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | fail_to_run | | M2M100ForConditionalGeneration | 1 | pass | 0.0000 | | XGLMForCausalLM | 0 | 0.0000 | 0.0000 | +-----------------------------------------+----+-----------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | MobileBertForMaskedLM | 16 | 177.6258 | 256.3088 | | MobileBertForQuestionAnswering | 32 | 173.307 | 244.0373 | | M2M100ForConditionalGeneration | 2 | 46.0615 | 178.9848 | | T5ForConditionalGeneration | 4 | 12.8008 | 154.8311 | | MBartForConditionalGeneration | 8 | 47.4506 | 151.3125 | | XGLMForCausalLM | 1 | 30.2354 | 144.8421 | | BartForConditionalGeneration | 1 | 46.2761 | 140.5452 | | MT5ForConditionalGeneration | 2 | 19.9685 | 140.3639 | | XLNetLMHeadModel | 4 | 41.7665 | 137.2853 | | PegasusForConditionalGeneration | 4 | 46.8928 | 136.7294 | | DebertaForQuestionAnswering | 4 | 14.3774 | 130.8288 | | DebertaForMaskedLM | 4 | 14.5131 | 120.3766 | | MegatronBertForCausalLM | 2 | 31.7314 | 118.0195 | | MegatronBertForQuestionAnswering | 8 | 31.7457 | 117.8187 | | YituTechConvBert | 1 | 20.5728 | 101.3281 | | BlenderbotSmallForConditionalGeneration | 32 | 25.0994 | 96.1492 | | PLBartForConditionalGeneration | 8 | 17.4257 | 94.093 | | T5Small | 1 | 12.6393 | 91.9498 | | OPTForCausalLM | 4 | 12.2074 | 71.8761 | | MBartForCausalLM | 16 | 17.439 | 68.2201 | | ElectraForQuestionAnswering | 64 | 12.8934 | 67.2964 | | TrOCRForCausalLM | 8 | 17.0497 | 65.3169 | | LayoutLMForSequenceClassification | 16 | 12.9564 | 65.1504 | | ElectraForCausalLM | 1 | 12.7619 | 64.8722 | | BartForCausalLM | 2 | 16.9692 | 64.1101 | | RobertaForQuestionAnswering | 64 | 13.3347 | 63.5298 | | PegasusForCausalLM | 8 | 17.0779 | 62.4789 | | BertForMaskedLM | 64 | 12.6206 | 61.9757 | | BigBird | 1 | 20.3027 | 61.6895 | | BertForQuestionAnswering | 64 | 12.604 | 61.571 | | GPT2ForSequenceClassification | 4 | 10.1868 | 60.4961 | | CamemBert | 1 | 12.6605 | 59.846 | | RobertaForCausalLM | 4 | 12.8908 | 58.2581 | | BlenderbotSmallForCausalLM | 64 | 9.5692 | 52.4237 | | PLBartForCausalLM | 16 | 6.8495 | 48.1822 | | AlbertForMaskedLM | 2 | 9.2523 | 45.7043 | | AlbertForQuestionAnswering | 2 | 8.8283 | 45.5348 | | DistillGPT2 | 1 | 4.7016 | 41.7968 | | DistilBertForMaskedLM | 16 | 5.5164 | 39.5652 | | DistilBertForQuestionAnswering | 32 | 5.4656 | 39.5651 | | Speech2Text2ForCausalLM | 64 | 6.9795 | 38.258 | | AllenaiLongformerBase | 1 | 22.9042 | nan | | LayoutLMForMaskedLM | 16 | 13.22 | nan | +-----------------------------------------+----+-----------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+-----------+----------+ | name | bs | aot_eager | inductor | +-----------------------------------------+----+-----------+----------+ | GPT2ForSequenceClassification | 4 | 0.9163 | 1.07 | | ElectraForQuestionAnswering | 64 | 0.9539 | 1.0237 | | XLNetLMHeadModel | 4 | 0.8791 | 1.0109 | | T5Small | 1 | 0.9124 | 0.9876 | | BertForMaskedLM | 64 | 0.899 | 0.9811 | | LayoutLMForSequenceClassification | 16 | 0.9325 | 0.9712 | | BlenderbotSmallForConditionalGeneration | 32 | 0.8996 | 0.9557 | | BartForCausalLM | 2 | 0.8769 | 0.9545 | | T5ForConditionalGeneration | 4 | 0.9594 | 0.9525 | | Speech2Text2ForCausalLM | 64 | 0.8489 | 0.9452 | | DistilBertForMaskedLM | 16 | 0.8698 | 0.9448 | | ElectraForCausalLM | 1 | 0.8955 | 0.941 | | PLBartForCausalLM | 16 | 0.8667 | 0.9395 | | BlenderbotSmallForCausalLM | 64 | 0.8172 | 0.9269 | | BertForQuestionAnswering | 64 | 0.9315 | 0.9256 | | RobertaForQuestionAnswering | 64 | 0.9315 | 0.9254 | | BartForConditionalGeneration | 1 | 0.8619 | 0.881 | | AlbertForQuestionAnswering | 2 | 0.6451 | 0.8636 | | MBartForCausalLM | 16 | 0.8398 | 0.8565 | | AlbertForMaskedLM | 2 | 0.6364 | 0.8515 | | BigBird | 1 | 0.9513 | 0.8349 | | DistilBertForQuestionAnswering | 32 | 0.8967 | 0.8334 | | PLBartForConditionalGeneration | 8 | 0.8307 | 0.8251 | | DistillGPT2 | 1 | 0.7548 | 0.812 | | MobileBertForMaskedLM | 16 | 0.8983 | 0.7803 | | MBartForConditionalGeneration | 8 | 0.8187 | 0.7699 | | TrOCRForCausalLM | 8 | 0.7955 | 0.7566 | | CamemBert | 1 | 0.7872 | 0.7482 | | OPTForCausalLM | 4 | 0.7501 | 0.7473 | | YituTechConvBert | 1 | 0.7819 | 0.7407 | | PegasusForCausalLM | 8 | 0.9444 | 0.7324 | | RobertaForCausalLM | 4 | 0.7741 | 0.7309 | | XGLMForCausalLM | 1 | 0.9992 | 0.7214 | | MegatronBertForQuestionAnswering | 8 | 0.8218 | 0.7107 | | PegasusForConditionalGeneration | 4 | 0.9196 | 0.6769 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.6697 | | M2M100ForConditionalGeneration | 2 | 0.9497 | 0.6568 | | MobileBertForQuestionAnswering | 32 | 0.9796 | 0.6265 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | | DebertaForMaskedLM | 4 | 0.9826 | 0.4498 | | DebertaForQuestionAnswering | 4 | 1.0568 | 0.3761 | | AllenaiLongformerBase | 1 | 0.9477 | nan | | LayoutLMForMaskedLM | 16 | 0.9238 | nan | +-----------------------------------------+----+-----------+----------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/bjLDroe.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 91%, 50/55 | 98%, 43/44  | 100%, 61/61 |
|   aot_eager    | 89%, 49/55 | 98%, 43/44  | 90%, 55/61  |
| aot_cudagraphs | 25%, 14/55 |  0%, 0/44   |  2%, 1/61   |
|  aot_nvfuser   | 58%, 32/55 |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 84%, 46/55 | 93%, 41/44  | 95%, 58/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    1.00x    |
|  aot_nvfuser   |   1.13x    |    1.12x    |    1.12x    |
|    inductor    |   1.39x    |    1.60x    |    1.21x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.42    |    14.22    |    11.34    |
|   aot_eager    |    9.77    |    21.16    |    16.79    |
| aot_cudagraphs |    4.86    |     0.0     |    7.42     |
|  aot_nvfuser   |   22.48    |    10.56    |    57.73    |
|    inductor    |   238.15   |   109.27    |   366.65    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.95x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.86x    |    0.89x    |    0.88x    |
| aot_cudagraphs |   0.41x    |    0.0x     |    0.25x    |
|  aot_nvfuser   |   0.83x    |    1.08x    |    0.85x    |
|    inductor    |   0.78x    |    0.74x    |    0.90x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 0.9991 | 1.0087 | 0.0 | 1.4479 | 4.9052 | | timm_efficientdet | 1 | 0.9824 | 0.8787 | 0.0 | 0.0 | 3.9475 | | functorch_dp_cifar10 | 64 | 0.9963 | 0.9772 | 0.0 | 1.197 | 3.6233 | | timm_vision_transformer | 8 | 1.0025 | 0.9173 | 0.0 | 1.3464 | 2.5509 | | drq | 1 | 1.0043 | 0.8529 | 0.0 | 1.0585 | 2.4584 | | BERT_pytorch | 16 | 1.0078 | 0.8843 | 0.0 | 0.0 | 1.8656 | | resnet18 | 16 | 1.0033 | 1.104 | 0.0 | 1.3915 | 1.8125 | | dcgan | 32 | 0.9844 | 1.0223 | 1.0738 | 1.1668 | 1.7591 | | lennard_jones | 1000 | 0.9793 | 0.8541 | 1.062 | 1.027 | 1.7573 | | pytorch_struct | 200 | 0.9964 | 0.7439 | 0.8929 | 0.8905 | 1.7547 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0004 | 0.9318 | 1.1166 | 1.2026 | 1.7117 | | hf_Albert | 8 | 1.0013 | 0.9975 | 0.0 | 0.0 | 1.6656 | | squeezenet1_1 | 32 | 1.0075 | 1.0042 | 0.9826 | 1.1641 | 1.6018 | | resnext50_32x4d | 8 | 1.0038 | 1.0848 | 0.0 | 1.36 | 1.513 | | mobilenet_v3_large | 32 | 1.0035 | 1.1165 | 0.0 | 1.397 | 1.4827 | | timm_nfnet | 128 | 0.9995 | 0.9997 | 0.0 | 1.211 | 1.4715 | | hf_GPT2 | 4 | 1.0071 | 0.9753 | 0.0 | 0.0 | 1.4298 | | hf_T5_large | 2 | 1.0245 | 0.8903 | 0.0 | 0.0 | 1.4073 | | soft_actor_critic | 256 | 1.0007 | 0.7819 | 1.0121 | 1.045 | 1.377 | | fastNLP_Bert | 6 | 0.9988 | 0.9748 | 0.0 | 0.0 | 1.3639 | | hf_Bart | 4 | 1.0125 | 0.9707 | 0.0 | 0.0 | 1.2501 | | LearningToPaint | 96 | 1.004 | 1.0612 | 0.0 | 1.2169 | 1.2114 | | pytorch_unet | 1 | 1.0 | 0.9966 | 0.0 | 1.0756 | 1.2054 | | Super_SloMo | 6 | 1.0 | 0.9973 | 0.0 | 0.0 | 1.1763 | | vgg16 | 64 | 0.9999 | 0.9982 | 0.7928 | 0.9965 | 1.1707 | | alexnet | 128 | 0.9999 | 0.9985 | 0.7786 | 1.001 | 1.1615 | | hf_DistilBert | 8 | 0.9997 | 0.9551 | 0.0 | 0.0 | 1.1572 | | hf_Bert | 4 | 1.0291 | 0.9963 | 0.0 | 0.0 | 1.1565 | | mnasnet1_0 | 32 | 0.9998 | 1.1013 | 0.7453 | 1.3026 | 1.1524 | | pytorch_stargan | 16 | 0.9992 | 0.9827 | 0.7293 | 1.0246 | 1.1189 | | Background_Matting | 4 | 0.9997 | 1.0225 | 0.0 | 1.0825 | 1.1138 | | hf_Reformer | 4 | 0.9965 | 0.0 | 0.8945 | 0.0 | 1.1108 | | hf_BigBird | 2 | 0.9941 | 0.9398 | 0.0 | 0.0 | 1.0989 | | timm_efficientnet | 32 | 0.961 | 0.8183 | 0.0 | 1.0739 | 1.0815 | | shufflenet_v2_x1_0 | 128 | 1.0008 | 1.0519 | 0.0 | 1.1884 | 1.0746 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9935 | 0.0 | 0.982 | 1.0531 | | attention_is_all_you_need_pytorch | 256 | 0.9973 | 0.9715 | 0.0 | 0.0 | 1.0492 | | timm_resnest | 32 | 1.0 | 1.0019 | 0.0 | 1.1832 | 1.0416 | | tts_angular | 64 | 0.9854 | 0.9625 | 0.9851 | 1.0031 | 1.0091 | | demucs | 4 | 1.0004 | 1.0005 | 1.0 | 1.0003 | 1.0002 | | dlrm | 2048 | 0.904 | 0.8836 | 0.0 | 0.0 | 0.9304 | | timm_vovnet | 32 | 0.9122 | 0.9055 | 0.0 | 0.9776 | 0.9152 | | nvidia_deeprecommender | 256 | 0.9991 | 0.9632 | 0.5842 | 0.9441 | 0.9044 | | mobilenet_v2 | 96 | 0.9996 | 0.9986 | 0.0 | 1.0422 | 0.8514 | | timm_regnet | 32 | 0.9654 | 0.964 | 0.0 | 1.0934 | 0.7595 | | resnet50 | 32 | 0.9986 | 0.9934 | 0.0 | 1.1608 | 0.7378 | | yolov3 | 16 | 0.9995 | 0.9944 | 0.0 | 1.1831 | 0.0 | | hf_T5 | 8 | 1.0018 | 0.9897 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9999 | 0.9805 | 0.0 | 0.0 | 0.0 | | speech_transformer | 32 | 1.0016 | 0.9211 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenet_v2_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | resnet50_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v2_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 51.5105 | 69.1348 | nan | nan | 1570.859 | | densenet121 | 4 | 13.3562 | 24.7318 | nan | 99.5096 | 1347.1298 | | hf_T5_large | 2 | 35.8088 | 65.6235 | nan | nan | 809.381 | | mnasnet1_0 | 32 | 3.1228 | 6.7735 | 25.4378 | 33.2738 | 734.4542 | | mobilenet_v3_large | 32 | 3.5899 | 7.2011 | nan | 56.0625 | 684.1648 | | mobilenet_v2 | 96 | 3.0708 | 6.5693 | nan | 38.7425 | 544.2103 | | resnext50_32x4d | 8 | 3.279 | 7.2643 | nan | 31.0494 | 528.878 | | timm_efficientnet | 32 | 5.789 | 10.1753 | nan | 56.0369 | 448.183 | | shufflenet_v2_x1_0 | 128 | 3.5612 | 7.8244 | nan | 29.147 | 348.9814 | | squeezenet1_1 | 32 | 0.6208 | 1.2845 | 3.0094 | 4.8813 | 338.1927 | | timm_resnest | 32 | 1.3141 | 3.3041 | nan | 35.9312 | 308.5375 | | timm_regnet | 32 | 8.0883 | 13.8651 | nan | 53.6162 | 267.5171 | | resnet50 | 32 | 3.2783 | 7.13 | nan | 34.6079 | 248.2819 | | attention_is_all_you_need_pytorch | 256 | 4.1896 | 9.9733 | nan | nan | 230.1729 | | timm_vovnet | 32 | 2.8535 | 5.8937 | nan | 24.9685 | 195.5114 | | timm_vision_transformer | 8 | 2.9706 | 6.1813 | nan | 11.1085 | 183.2258 | | functorch_dp_cifar10 | 64 | 0.7946 | 2.0622 | nan | 5.4202 | 182.6834 | | resnet18 | 16 | 0.9067 | 2.3716 | nan | 17.9107 | 168.1567 | | timm_vision_transformer_large | 8 | 22.7519 | 33.4327 | nan | 44.0679 | 167.013 | | BERT_pytorch | 16 | 4.7978 | 10.3795 | nan | nan | 158.981 | | LearningToPaint | 96 | 0.954 | 2.385 | nan | 24.3582 | 138.1686 | | hf_Bart | 4 | 7.1398 | 13.1698 | nan | nan | 127.2181 | | pytorch_stargan | 16 | 0.7827 | 2.6368 | 9.467 | 4.3549 | 127.126 | | fastNLP_Bert | 6 | 5.0298 | 9.6183 | nan | nan | 125.7024 | | hf_GPT2 | 4 | 3.4078 | 7.8094 | nan | nan | 123.6637 | | Background_Matting | 4 | 3.704 | 7.2381 | nan | 32.3312 | 113.9399 | | timm_nfnet | 128 | 6.556 | 11.458 | nan | 34.1762 | 101.3946 | | pytorch_struct | 200 | 0.3956 | 0.9066 | 1.4291 | 4.2106 | 96.611 | | hf_Albert | 8 | 1.0817 | 5.5058 | nan | nan | 70.3122 | | hf_Bert | 4 | 5.0448 | 9.3782 | nan | nan | 66.6859 | | hf_Reformer | 4 | 2.9659 | nan | 13.115 | nan | 65.8278 | | Super_SloMo | 6 | 2.1407 | 5.7243 | nan | nan | 64.7196 | | pytorch_unet | 1 | 1.0384 | 2.6731 | nan | 20.1145 | 55.6862 | | hf_BigBird | 2 | 10.7708 | 16.2465 | nan | nan | 50.6759 | | hf_DistilBert | 8 | 1.5586 | 3.8873 | nan | nan | 42.2735 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.7449 | 2.4894 | 7.836 | 4.0526 | 25.7952 | | vgg16 | 64 | 0.2968 | 0.7589 | 2.2805 | 2.5941 | 17.3894 | | drq | 1 | 0.2544 | 0.5167 | nan | 3.5535 | 14.848 | | dcgan | 32 | 0.2533 | 0.4967 | 1.1971 | 3.7861 | 13.547 | | dlrm | 2048 | 0.6134 | 0.9344 | nan | nan | 13.3158 | | alexnet | 128 | 0.2488 | 0.4892 | 1.1756 | 2.4427 | 11.1009 | | nvidia_deeprecommender | 256 | 0.2717 | 0.4553 | 0.7334 | 2.4597 | 9.1377 | | soft_actor_critic | 256 | 0.2537 | 0.3757 | 0.5877 | 1.579 | 7.5755 | | lennard_jones | 1000 | 0.2162 | 0.3509 | 0.4982 | 1.1228 | 3.7301 | | tts_angular | 64 | 0.3084 | 0.3561 | 0.481 | 1.0792 | 3.0467 | | demucs | 4 | 0.808 | 0.8143 | 0.7953 | 0.7936 | 0.6963 | | yolov3 | 16 | 7.2561 | 12.5914 | nan | 47.2978 | nan | | hf_GPT2_large | 4 | 20.5619 | 34.5505 | nan | nan | nan | | speech_transformer | 32 | 7.0946 | 13.3612 | nan | nan | nan | | hf_T5 | 8 | 3.7283 | 10.4884 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | Super_SloMo | 6 | 1.0024 | 0.956 | nan | nan | 1.1855 | | timm_efficientnet | 32 | 0.9998 | 0.7704 | nan | 0.7845 | 1.0652 | | timm_nfnet | 128 | 0.9393 | 0.897 | nan | 0.9515 | 1.022 | | timm_efficientdet | 1 | 1.0142 | 0.8251 | nan | nan | 1.0218 | | mobilenet_v2 | 96 | 0.9993 | 0.7661 | nan | 0.7676 | 0.9975 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.984 | 0.9884 | 0.9842 | | hf_GPT2 | 4 | 0.9548 | 0.887 | nan | nan | 0.9505 | | Background_Matting | 4 | 1.0026 | 0.952 | nan | 0.9773 | 0.9139 | | pytorch_stargan | 16 | 0.9975 | 1.019 | 0.2027 | 1.0085 | 0.9023 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.2326 | 0.9114 | 0.8941 | | hf_Albert | 8 | 0.9333 | 0.9333 | nan | nan | 0.8804 | | pytorch_unet | 1 | 0.9985 | 0.8536 | nan | 0.851 | 0.859 | | hf_Bart | 4 | 0.9618 | 0.8786 | nan | nan | 0.8533 | | hf_Bert | 4 | 0.9683 | 0.8952 | nan | nan | 0.8517 | | timm_regnet | 32 | 1.0013 | 0.8634 | nan | 0.8806 | 0.8481 | | shufflenet_v2_x1_0 | 128 | 1.0 | 0.9163 | nan | 0.8868 | 0.8447 | | fastNLP_Bert | 6 | 1.0012 | 0.9152 | nan | nan | 0.8343 | | attention_is_all_you_need_pytorch | 256 | 0.9481 | 0.9241 | nan | nan | 0.8264 | | timm_vovnet | 32 | 0.9933 | 0.7644 | nan | 0.7778 | 0.8252 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8237 | | hf_BigBird | 2 | 0.9609 | 0.9609 | nan | nan | 0.8205 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.2781 | 0.9742 | 0.8159 | | hf_DistilBert | 8 | 0.9212 | 0.9053 | nan | nan | 0.7841 | | dcgan | 32 | 1.0 | 0.7784 | 0.3321 | 0.7784 | 0.767 | | alexnet | 128 | 0.9998 | 0.7731 | 0.3805 | 0.7736 | 0.743 | | mnasnet1_0 | 32 | 0.9988 | 0.9087 | 0.1627 | 0.8348 | 0.7268 | | timm_vision_transformer_large | 8 | 1.0022 | 0.8433 | nan | 0.8015 | 0.7222 | | timm_vision_transformer | 8 | 1.0 | 0.8883 | nan | 0.8108 | 0.712 | | mobilenet_v3_large | 32 | 0.9958 | 0.8655 | nan | 0.8773 | 0.7041 | | dlrm | 2048 | 0.7282 | 0.7283 | nan | nan | 0.6974 | | timm_resnest | 32 | 0.9935 | 0.8862 | nan | 0.8075 | 0.6861 | | resnet50 | 32 | 1.0002 | 0.8763 | nan | 0.8011 | 0.6779 | | densenet121 | 4 | 1.0 | 0.8812 | nan | 0.8571 | 0.6618 | | resnext50_32x4d | 8 | 0.9994 | 0.8687 | nan | 0.8223 | 0.6615 | | vgg16 | 64 | 1.0 | 0.6663 | 0.2532 | 0.6664 | 0.6471 | | LearningToPaint | 96 | 0.9442 | 0.6896 | nan | 0.6279 | 0.6444 | | soft_actor_critic | 256 | 0.964 | 0.964 | 0.4356 | 0.9555 | 0.6428 | | drq | 1 | 0.8541 | 0.8541 | nan | 0.8541 | 0.6427 | | resnet18 | 16 | 0.9846 | 0.7907 | nan | 0.7038 | 0.6163 | | lennard_jones | 1000 | 1.0 | 1.0 | 0.3712 | 1.0947 | 0.5646 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4734 | 0.5598 | 0.5598 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | | functorch_dp_cifar10 | 64 | 0.9626 | 0.8251 | nan | 0.8254 | 0.4037 | | hf_Reformer | 4 | 0.3011 | nan | 0.1803 | nan | 0.299 | | yolov3 | 16 | 1.0072 | 0.8533 | nan | 0.8915 | nan | | hf_T5 | 8 | 0.9527 | 0.9446 | nan | nan | nan | | speech_transformer | 32 | 0.9988 | 0.9152 | nan | nan | nan | | hf_GPT2_large | 4 | 0.936 | 0.8771 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | ElectraForCausalLM | 1 | 1.0455 | 0.9436 | 0.0 | 0.0 | 4.4554 | | MT5ForConditionalGeneration | 2 | 1.0242 | 0.9142 | 0.0 | 0.0 | 4.3332 | | YituTechConvBert | 1 | 1.0303 | 0.9285 | 0.0 | 0.0 | 3.1989 | | MegatronBertForCausalLM | 2 | 1.0434 | 0.9421 | 0.0 | 0.0 | 2.8533 | | MobileBertForMaskedLM | 16 | 1.0189 | 0.8925 | 0.0 | 0.0 | 2.7208 | | RobertaForCausalLM | 4 | 1.0444 | 0.9303 | 0.0 | 0.0 | 2.6843 | | M2M100ForConditionalGeneration | 2 | 1.0415 | 0.8586 | 0.0 | 0.0 | 2.5929 | | OPTForCausalLM | 4 | 1.0128 | 0.899 | 0.0 | 0.0 | 2.5804 | | XGLMForCausalLM | 1 | 1.0142 | 0.8666 | 0.0 | 0.0 | 2.4853 | | MobileBertForQuestionAnswering | 32 | 1.0192 | 0.9104 | 0.0 | 0.0 | 2.4298 | | CamemBert | 1 | 1.0496 | 0.947 | 0.0 | 0.0 | 2.2877 | | DistillGPT2 | 1 | 1.0314 | 0.9338 | 0.0 | 0.0 | 2.1603 | | GoogleFnet | 1 | 1.0024 | 0.8114 | 0.0 | 1.1191 | 2.1218 | | PegasusForConditionalGeneration | 4 | 1.0119 | 0.8932 | 0.0 | 0.0 | 2.036 | | PLBartForConditionalGeneration | 8 | 1.02 | 0.8992 | 0.0 | 0.0 | 1.7185 | | GPT2ForSequenceClassification | 4 | 0.9989 | 0.9774 | 0.0 | 0.0 | 1.6752 | | MegatronBertForQuestionAnswering | 8 | 1.044 | 0.9294 | 0.0 | 0.0 | 1.5913 | | MBartForConditionalGeneration | 8 | 1.0148 | 0.907 | 0.0 | 0.0 | 1.4775 | | XLNetLMHeadModel | 4 | 1.001 | 0.9642 | 0.0 | 0.0 | 1.4305 | | ElectraForQuestionAnswering | 64 | 0.9991 | 0.9848 | 0.0 | 0.0 | 1.3727 | | T5ForConditionalGeneration | 4 | 0.9985 | 0.9576 | 0.0 | 0.0 | 1.3585 | | AlbertForQuestionAnswering | 2 | 1.0007 | 1.0019 | 0.0 | 0.0 | 1.306 | | AlbertForMaskedLM | 2 | 1.0006 | 1.0012 | 0.0 | 0.0 | 1.3027 | | LayoutLMForSequenceClassification | 16 | 0.9991 | 0.9881 | 0.0 | 0.0 | 1.2595 | | DebertaForQuestionAnswering | 4 | 0.9311 | 0.7532 | 0.8024 | 0.0 | 1.2423 | | TrOCRForCausalLM | 8 | 1.0121 | 0.9476 | 0.0 | 0.0 | 1.2385 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0111 | 0.9334 | 0.0 | 0.0 | 1.23 | | T5Small | 1 | 1.0232 | 0.9484 | 0.0 | 0.0 | 1.2282 | | BartForConditionalGeneration | 1 | 1.0119 | 0.9907 | 0.0 | 0.0 | 1.2215 | | Speech2Text2ForCausalLM | 64 | 0.9987 | 0.9377 | 0.0 | 0.0 | 1.2182 | | DistilBertForQuestionAnswering | 32 | 1.0257 | 0.9798 | 0.0 | 0.0 | 1.1967 | | PegasusForCausalLM | 8 | 1.0098 | 0.9249 | 0.0 | 0.0 | 1.1947 | | DistilBertForMaskedLM | 16 | 1.0274 | 0.9737 | 0.0 | 0.0 | 1.1673 | | LayoutLMForMaskedLM | 16 | 0.9992 | 0.9693 | 0.0 | 0.0 | 1.1651 | | BartForCausalLM | 2 | 0.9993 | 0.9663 | 0.0 | 0.0 | 1.1046 | | PLBartForCausalLM | 16 | 1.0048 | 0.9436 | 0.0 | 0.0 | 1.1015 | | RobertaForQuestionAnswering | 64 | 0.999 | 0.9832 | 0.0 | 0.0 | 1.0977 | | BertForQuestionAnswering | 64 | 0.9989 | 0.9827 | 0.0 | 0.0 | 1.0964 | | BigBird | 1 | 0.9913 | 0.9393 | 0.0 | 0.0 | 1.0871 | | DebertaForMaskedLM | 4 | 0.9348 | 0.8095 | 0.7234 | 0.0 | 1.0837 | | MBartForCausalLM | 16 | 1.0067 | 0.9635 | 0.0 | 0.0 | 1.053 | | BertForMaskedLM | 64 | 0.9992 | 0.9608 | 0.0 | 0.0 | 1.042 | | BlenderbotSmallForCausalLM | 64 | 1.0011 | 0.909 | 0.0 | 0.0 | 1.0115 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | GoogleFnet | 1 | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 17.5197 | 35.0716 | nan | nan | 305.9464 | | MobileBertForMaskedLM | 16 | 135.2468 | 157.6587 | nan | nan | 274.7763 | | MobileBertForQuestionAnswering | 32 | 129.6905 | 153.7345 | nan | nan | 255.9839 | | M2M100ForConditionalGeneration | 2 | 25.8249 | 38.1287 | nan | nan | 214.2779 | | MT5ForConditionalGeneration | 2 | 6.3337 | 16.2124 | nan | nan | 187.0089 | | YituTechConvBert | 1 | 8.8852 | 16.1528 | nan | nan | 178.8949 | | T5ForConditionalGeneration | 4 | 3.6747 | 10.5429 | nan | nan | 166.5406 | | XGLMForCausalLM | 1 | 15.0955 | 24.4969 | nan | nan | 161.4696 | | MBartForConditionalGeneration | 8 | 26.2231 | 38.8571 | nan | nan | 159.4234 | | BartForConditionalGeneration | 1 | 25.4253 | 38.3882 | nan | nan | 158.0863 | | PegasusForConditionalGeneration | 4 | 25.7842 | 37.4911 | nan | nan | 155.1584 | | MegatronBertForCausalLM | 2 | 15.9751 | 25.4865 | nan | nan | 138.1979 | | DebertaForMaskedLM | 4 | 6.9867 | 13.1786 | 49.9544 | nan | 137.1263 | | MegatronBertForQuestionAnswering | 8 | 16.159 | 25.568 | nan | nan | 134.4903 | | T5Small | 1 | 3.6926 | 10.3463 | nan | nan | 133.6301 | | PLBartForConditionalGeneration | 8 | 7.2968 | 13.2318 | nan | nan | 131.7485 | | BlenderbotSmallForConditionalGeneration | 32 | 11.8403 | 19.7241 | nan | nan | 118.8435 | | DebertaForQuestionAnswering | 4 | 6.9042 | 13.4129 | 50.1618 | nan | 107.7468 | | RobertaForCausalLM | 4 | 4.9535 | 9.7845 | nan | nan | 100.682 | | LayoutLMForSequenceClassification | 16 | 5.1283 | 9.9782 | nan | nan | 92.4368 | | PegasusForCausalLM | 8 | 9.7938 | 14.2359 | nan | nan | 86.443 | | GPT2ForSequenceClassification | 4 | 3.5369 | 7.9757 | nan | nan | 79.3826 | | OPTForCausalLM | 4 | 4.6514 | 9.1108 | nan | nan | 77.8106 | | MBartForCausalLM | 16 | 9.8837 | 14.3594 | nan | nan | 77.6976 | | ElectraForQuestionAnswering | 64 | 4.8714 | 9.4723 | nan | nan | 77.1258 | | BertForMaskedLM | 64 | 4.9013 | 9.4034 | nan | nan | 75.8649 | | BartForCausalLM | 2 | 9.7793 | 13.9953 | nan | nan | 74.4351 | | LayoutLMForMaskedLM | 16 | 5.2202 | 9.9984 | nan | nan | 74.0493 | | TrOCRForCausalLM | 8 | 9.6551 | 14.0138 | nan | nan | 67.6627 | | RobertaForQuestionAnswering | 64 | 4.8562 | 9.3848 | nan | nan | 62.7951 | | ElectraForCausalLM | 1 | 5.0751 | 9.6382 | nan | nan | 62.4936 | | DistillGPT2 | 1 | 1.4093 | 3.6037 | nan | nan | 62.393 | | PLBartForCausalLM | 16 | 3.0927 | 5.4299 | nan | nan | 59.3231 | | DistilBertForQuestionAnswering | 32 | 1.6682 | 3.9686 | nan | nan | 59.0738 | | CamemBert | 1 | 4.9241 | 9.44 | nan | nan | 58.7884 | | BertForQuestionAnswering | 64 | 4.9153 | 9.4155 | nan | nan | 58.1473 | | BlenderbotSmallForCausalLM | 64 | 4.6979 | 7.7801 | nan | nan | 57.8001 | | Speech2Text2ForCausalLM | 64 | 3.0594 | 5.3309 | nan | nan | 57.62 | | AlbertForMaskedLM | 2 | 1.1999 | 5.7374 | nan | nan | 56.499 | | BigBird | 1 | 10.7378 | 16.4461 | nan | nan | 53.5552 | | DistilBertForMaskedLM | 16 | 1.7303 | 3.942 | nan | nan | 45.7724 | | AlbertForQuestionAnswering | 2 | 1.1873 | 5.6661 | nan | nan | 38.8054 | | GoogleFnet | 1 | 2.0873 | 4.2615 | nan | 10.5603 | 35.0482 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9342 | 0.9091 | nan | nan | 1.0318 | | XLNetLMHeadModel | 4 | 1.0001 | 0.8976 | nan | nan | 0.9717 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.9361 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | nan | nan | 0.9339 | | BertForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9145 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.888 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8445 | | DistilBertForQuestionAnswering | 32 | 1.0 | 0.9046 | nan | nan | 0.8394 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | nan | nan | 0.8321 | | BartForCausalLM | 2 | 1.0 | 0.8847 | nan | nan | 0.8303 | | BigBird | 1 | 1.0001 | 0.9549 | nan | nan | 0.8224 | | DistilBertForMaskedLM | 16 | 0.9998 | 0.9138 | nan | nan | 0.8055 | | PLBartForCausalLM | 16 | 0.9997 | 0.8802 | nan | nan | 0.8028 | | MBartForCausalLM | 16 | 1.0 | 0.8629 | nan | nan | 0.8005 | | DistillGPT2 | 1 | 1.0003 | 0.7721 | nan | nan | 0.7997 | | Speech2Text2ForCausalLM | 64 | 1.0 | 0.88 | nan | nan | 0.7768 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.7754 | | XGLMForCausalLM | 1 | 0.9999 | 0.9999 | nan | nan | 0.7728 | | BartForConditionalGeneration | 1 | 1.0 | 0.8465 | nan | nan | 0.7708 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0 | 0.9036 | nan | nan | 0.7612 | | PLBartForConditionalGeneration | 8 | 0.9997 | 0.8222 | nan | nan | 0.7547 | | CamemBert | 1 | 0.998 | 0.7977 | nan | nan | 0.7369 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.7298 | | TrOCRForCausalLM | 8 | 1.0 | 0.8048 | nan | nan | 0.7284 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | nan | nan | 0.7277 | | MBartForConditionalGeneration | 8 | 1.0 | 0.8137 | nan | nan | 0.727 | | OPTForCausalLM | 4 | 0.9979 | 0.75 | nan | nan | 0.714 | | RobertaForCausalLM | 4 | 0.9058 | 0.7778 | nan | nan | 0.7099 | | PegasusForCausalLM | 8 | 1.0 | 0.9323 | nan | nan | 0.7012 | | MegatronBertForQuestionAnswering | 8 | 0.923 | 0.8265 | nan | nan | 0.6997 | | GoogleFnet | 1 | 1.0003 | 0.9447 | nan | 1.0813 | 0.6953 | | M2M100ForConditionalGeneration | 2 | 0.9797 | 0.9795 | nan | nan | 0.669 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.7066 | nan | nan | 0.6453 | | PegasusForConditionalGeneration | 4 | 0.9721 | 0.9004 | nan | nan | 0.642 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | nan | nan | 0.6173 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9369 | nan | nan | 0.6126 | | ElectraForCausalLM | 1 | 1.0 | 0.9107 | nan | nan | 0.6123 | | AlbertForMaskedLM | 2 | 0.9999 | 0.9172 | nan | nan | 0.6027 | | MobileBertForMaskedLM | 16 | 0.9997 | 0.9179 | nan | nan | 0.5861 | | MobileBertForQuestionAnswering | 32 | 1.0 | 0.9716 | nan | nan | 0.4668 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.352 | nan | 0.4265 | | DebertaForQuestionAnswering | 4 | 0.9845 | 1.0525 | 0.3277 | nan | 0.3569 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 0.9996 | 1.0234 | 0.0 | 1.4428 | 4.931 | | hrnet_w18 | 2 | 1.0055 | 1.0607 | 0.0 | 1.445 | 4.5409 | | res2next50 | 2 | 1.0047 | 1.0349 | 0.0 | 1.371 | 4.3401 | | coat_lite_mini | 128 | 0.9998 | 0.9998 | 0.0 | 1.075 | 1.7154 | | ghostnet_100 | 128 | 0.9985 | 0.9942 | 0.0 | 1.2476 | 1.6158 | | tnt_s_patch16_224 | 64 | 0.9998 | 0.9984 | 0.0 | 1.5639 | 1.4959 | | dm_nfnet_f0 | 128 | 0.9997 | 1.0002 | 0.0 | 1.2118 | 1.4722 | | xcit_large_24_p8_224 | 5 | 1.0 | 0.9889 | 0.0 | 0.0 | 1.4539 | | twins_pcpvt_base | 32 | 1.0032 | 0.9692 | 0.0 | 1.347 | 1.439 | | volo_d1_224 | 64 | 0.9993 | 0.9943 | 0.0 | 1.1388 | 1.3985 | | crossvit_9_240 | 64 | 1.0059 | 0.9954 | 0.0 | 1.1391 | 1.3976 | | nfnet_l0 | 64 | 0.9978 | 0.7971 | 0.0 | 1.0534 | 1.3847 | | gmixer_24_224 | 64 | 0.9993 | 0.8424 | 0.0 | 0.9925 | 1.3623 | | jx_nest_base | 32 | 0.999 | 0.9938 | 0.0 | 1.227 | 1.297 | | lcnet_050 | 128 | 0.9564 | 0.9487 | 0.0 | 1.4997 | 1.2591 | | convit_base | 32 | 0.9991 | 0.9953 | 0.0 | 1.1951 | 1.2525 | | pit_b_224 | 64 | 0.9998 | 0.9985 | 0.0 | 1.0608 | 1.2184 | | cait_m36_384 | 2 | 0.998 | 0.8939 | 0.0 | 1.1053 | 1.1998 | | convnext_base | 32 | 0.9992 | 0.9973 | 0.0 | 1.0444 | 1.1774 | | swin_base_patch4_window7_224 | 64 | 0.9995 | 0.9727 | 0.0 | 1.0031 | 1.1644 | | gmlp_s16_224 | 64 | 0.999 | 0.9963 | 0.0 | 0.9996 | 1.1519 | | inception_v3 | 128 | 0.9997 | 0.9979 | 0.0 | 1.1252 | 1.1406 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9814 | 0.0 | 0.9546 | 1.1257 | | adv_inception_v3 | 128 | 1.0001 | 0.9962 | 0.0 | 1.1254 | 1.1106 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9992 | 0.0 | 1.0189 | 1.1089 | | gluon_inception_v3 | 128 | 0.9999 | 0.9975 | 0.0 | 1.1257 | 1.1079 | | vit_base_patch16_224 | 64 | 0.9998 | 0.9991 | 0.0 | 0.9792 | 1.1004 | | poolformer_m36 | 64 | 0.9994 | 0.9991 | 0.0 | 1.007 | 1.083 | | regnety_002 | 128 | 0.9801 | 0.9915 | 0.0 | 1.3543 | 1.0721 | | mixer_b16_224 | 64 | 0.9997 | 0.9974 | 0.0 | 0.9859 | 1.0501 | | mixnet_l | 64 | 0.9707 | 0.8722 | 0.0 | 1.0056 | 1.0437 | | resmlp_12_224 | 128 | 0.9996 | 0.9993 | 0.6954 | 0.0 | 1.0218 | | pnasnet5large | 16 | 0.9991 | 0.997 | 0.0 | 1.0824 | 1.0157 | | dla102 | 64 | 0.9992 | 0.9957 | 0.0 | 1.2867 | 1.0124 | | tf_mixnet_l | 64 | 0.9718 | 0.8744 | 0.0 | 1.0055 | 1.0077 | | repvgg_a2 | 128 | 0.9637 | 0.9627 | 0.0 | 1.1205 | 1.0054 | | resnest101e | 32 | 1.0021 | 1.0145 | 0.0 | 1.2074 | 0.9828 | | dpn107 | 32 | 0.9586 | 0.9507 | 0.0 | 1.0282 | 0.9726 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 1.0604 | 0.9278 | | sebotnet33ts_256 | 64 | 0.9758 | 0.8075 | 0.0 | 1.0532 | 0.9189 | | visformer_small | 128 | 1.0 | 1.0021 | 0.0 | 1.0216 | 0.901 | | gernet_l | 128 | 0.9734 | 0.9723 | 0.0 | 1.0963 | 0.8829 | | cspdarknet53 | 64 | 0.9585 | 0.951 | 0.0 | 1.184 | 0.8741 | | fbnetv3_b | 128 | 0.9649 | 0.9608 | 0.0 | 1.1338 | 0.8738 | | selecsls42b | 128 | 0.9995 | 0.9983 | 0.0 | 1.2054 | 0.8712 | | mobilenetv2_100 | 128 | 0.9664 | 0.963 | 0.0 | 1.0139 | 0.8526 | | rexnet_100 | 128 | 0.9731 | 0.8155 | 0.0 | 0.9836 | 0.8392 | | mnasnet_100 | 128 | 0.9666 | 0.962 | 0.0 | 1.1567 | 0.8375 | | tinynet_a | 128 | 0.966 | 0.7754 | 0.0 | 0.9707 | 0.8263 | | mobilevit_s | 32 | 0.9762 | 0.7637 | 0.0 | 0.9654 | 0.8201 | | mobilenetv3_large_100 | 128 | 0.9652 | 0.9625 | 0.0 | 1.1648 | 0.8065 | | res2net101_26w_4s | 64 | 0.9989 | 0.9976 | 0.0 | 1.1769 | 0.7997 | | spnasnet_100 | 128 | 0.9612 | 0.9562 | 0.0 | 1.138 | 0.7958 | | fbnetc_100 | 128 | 0.9661 | 0.9627 | 0.0 | 1.1865 | 0.7788 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9997 | 0.0 | 1.1076 | 0.7656 | | ese_vovnet19b_dw | 128 | 0.9789 | 0.9774 | 0.0 | 1.1452 | 0.7575 | | eca_halonext26ts | 64 | 0.9743 | 0.7768 | 0.0 | 1.0161 | 0.7388 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7832 | 0.0 | 0.985 | 0.7312 | | gluon_xception65 | 32 | 0.9994 | 0.9964 | 0.0 | 1.0393 | 0.6983 | | eca_botnext26ts_256 | 64 | 0.9738 | 0.77 | 0.0 | 1.018 | 0.689 | | botnet26t_256 | 128 | 0.9853 | 0.9848 | 0.0 | 1.2257 | 0.6747 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | convnext_base | 2 | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | | gluon_xception65 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 98.6712 | 129.1724 | nan | 304.0537 | 1285.4743 | | dpn107 | 32 | 13.3754 | 24.29 | nan | 87.4317 | 1191.6382 | | pnasnet5large | 16 | 58.9642 | 81.7977 | nan | 186.1947 | 1076.8827 | | rexnet_100 | 128 | 6.2383 | 11.638 | nan | 106.3987 | 888.1642 | | res2net50_14w_8s | 2 | 19.6937 | 32.6428 | nan | 87.1574 | 876.3242 | | eca_botnext26ts_256 | 64 | 2.4388 | 6.0508 | nan | 50.3269 | 774.3202 | | mobilevit_s | 32 | 5.8551 | 10.7127 | nan | 45.7644 | 725.689 | | mixnet_l | 64 | 13.541 | 19.9113 | nan | 71.4098 | 702.8912 | | ghostnet_100 | 128 | 8.7886 | 15.5764 | nan | 65.8406 | 629.9192 | | tinynet_a | 128 | 7.4538 | 12.9979 | nan | 67.1415 | 600.5069 | | twins_pcpvt_base | 32 | 25.4752 | 36.4329 | nan | 68.4858 | 573.9034 | | fbnetc_100 | 128 | 5.4489 | 10.0964 | nan | 49.0758 | 573.6376 | | fbnetv3_b | 128 | 13.1373 | 19.9338 | nan | 85.3399 | 548.3581 | | resnest101e | 32 | 26.1724 | 39.6525 | nan | 99.9511 | 546.489 | | coat_lite_mini | 128 | 2.9903 | 6.7801 | nan | 16.3836 | 537.8709 | | swin_base_patch4_window7_224 | 64 | 12.1773 | 22.2394 | nan | 67.7967 | 477.8056 | | dla102 | 64 | 10.5032 | 18.7199 | nan | 71.5497 | 474.3166 | | res2next50 | 2 | 7.2348 | 14.1651 | nan | 47.9668 | 459.3931 | | tf_mixnet_l | 64 | 13.5475 | 20.765 | nan | 70.0162 | 428.7479 | | sebotnet33ts_256 | 64 | 3.7255 | 8.1592 | nan | 53.8094 | 428.4917 | | cspdarknet53 | 64 | 6.0545 | 10.9249 | nan | 52.1622 | 427.7586 | | botnet26t_256 | 128 | 2.2674 | 5.3433 | nan | 42.1672 | 410.3641 | | eca_halonext26ts | 64 | 2.5674 | 6.2787 | nan | 52.114 | 395.3512 | | mnasnet_100 | 128 | 3.9657 | 7.7173 | nan | 40.1768 | 388.1006 | | res2net101_26w_4s | 64 | 24.9464 | 39.8152 | nan | 104.8715 | 378.42 | | mobilenetv2_100 | 128 | 4.0834 | 7.496 | nan | 40.6293 | 374.3737 | | tf_efficientnet_b0 | 128 | 5.6664 | 10.385 | nan | 65.9432 | 358.0571 | | adv_inception_v3 | 128 | 8.1852 | 15.9016 | nan | 74.9539 | 356.0919 | | ese_vovnet19b_dw | 128 | 1.8982 | 3.9436 | nan | 32.5785 | 328.4316 | | convnext_base | 32 | 11.4923 | 16.096 | nan | 31.6831 | 316.8313 | | xcit_large_24_p8_224 | 5 | 37.1187 | 51.4882 | nan | nan | 312.4874 | | regnety_002 | 128 | 4.8749 | 8.8489 | nan | 49.9733 | 305.9002 | | mobilenetv3_large_100 | 128 | 4.355 | 7.9916 | nan | 67.5523 | 277.8999 | | gluon_xception65 | 32 | 15.6406 | 24.5945 | nan | 55.6431 | 276.5347 | | cait_m36_384 | 2 | 46.5219 | 63.937 | nan | 91.4589 | 274.0069 | | visformer_small | 128 | 2.2571 | 5.5074 | nan | 25.7864 | 272.1672 | | jx_nest_base | 32 | 9.6565 | 16.9306 | nan | 66.5693 | 265.8097 | | crossvit_9_240 | 64 | 7.3287 | 13.4188 | nan | 32.8212 | 244.1634 | | gernet_l | 128 | 4.6462 | 9.0919 | nan | 39.2841 | 219.5804 | | selecsls42b | 128 | 2.3756 | 5.4282 | nan | 41.0006 | 215.9729 | | poolformer_m36 | 64 | 13.0446 | 19.5598 | nan | 34.3082 | 213.2786 | | lcnet_050 | 128 | 1.9507 | 3.999 | nan | 31.8533 | 204.466 | | spnasnet_100 | 128 | 5.4029 | 10.0674 | nan | 47.1409 | 203.189 | | swsl_resnext101_32x16d | 32 | 10.026 | 18.4376 | nan | 48.6326 | 183.7595 | | gluon_inception_v3 | 128 | 8.2025 | 15.4166 | nan | 75.2205 | 175.9461 | | convit_base | 32 | 3.9184 | 8.4475 | nan | 20.8037 | 167.9331 | | inception_v3 | 128 | 8.39 | 15.5865 | nan | 74.8391 | 166.4901 | | volo_d1_224 | 64 | 6.7498 | 12.7145 | nan | 32.6013 | 159.846 | | gmlp_s16_224 | 64 | 9.095 | 13.8868 | nan | 20.8052 | 139.0742 | | pit_b_224 | 64 | 3.6357 | 7.2231 | nan | 15.0088 | 135.8108 | | tnt_s_patch16_224 | 64 | 11.9392 | 20.1264 | nan | 34.0659 | 128.6679 | | gmixer_24_224 | 64 | 8.3774 | 13.6336 | nan | 23.8974 | 121.5108 | | repvgg_a2 | 128 | 4.5668 | 8.8221 | nan | 47.0965 | 109.2737 | | resmlp_12_224 | 128 | 2.7829 | 4.8097 | 7.4167 | nan | 95.8777 | | nfnet_l0 | 64 | 5.7948 | 11.1862 | nan | 31.0786 | 88.0471 | | dm_nfnet_f0 | 128 | 6.4796 | 11.5027 | nan | 34.4834 | 85.9462 | | mixer_b16_224 | 64 | 2.6586 | 5.0283 | nan | 12.7966 | 84.8839 | | convmixer_768_32 | 32 | 6.9275 | 11.7534 | nan | 19.4368 | 78.6739 | | beit_base_patch16_224 | 64 | 4.4367 | 9.0898 | nan | 17.357 | 67.4849 | | deit_base_distilled_patch16_224 | 64 | 3.0194 | 6.2321 | nan | 12.536 | 67.4842 | | vit_base_patch16_224 | 64 | 2.8662 | 6.0204 | nan | 11.4566 | 57.186 | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 0.9992 | 0.9684 | nan | 0.9825 | 1.3808 | | nfnet_l0 | 64 | 1.0008 | 0.8298 | nan | 0.813 | 1.2555 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1783 | | tinynet_a | 128 | 1.0 | 0.7831 | nan | 0.7845 | 1.1735 | | rexnet_100 | 128 | 0.9992 | 0.7879 | nan | 0.871 | 1.1072 | | convit_base | 32 | 1.0001 | 0.8879 | nan | 0.9506 | 1.068 | | dm_nfnet_f0 | 128 | 0.9393 | 0.897 | nan | 0.9515 | 1.022 | | mobilenetv2_100 | 128 | 0.9998 | 0.7664 | nan | 0.7679 | 1.0051 | | dla102 | 64 | 0.9881 | 0.9181 | nan | 0.9541 | 1.0011 | | mobilevit_s | 32 | 0.9999 | 0.7692 | nan | 0.7431 | 1.0011 | | poolformer_m36 | 64 | 1.0003 | 0.9533 | nan | 0.9368 | 0.9734 | | eca_halonext26ts | 64 | 0.9938 | 0.7717 | nan | 0.7731 | 0.9711 | | eca_botnext26ts_256 | 64 | 1.0 | 0.7705 | nan | 0.7679 | 0.9703 | | tf_mixnet_l | 64 | 1.0001 | 0.861 | nan | 0.8605 | 0.9698 | | convmixer_768_32 | 32 | 1.0 | 0.9868 | nan | 0.9807 | 0.9656 | | cait_m36_384 | 2 | 1.0001 | 0.9024 | nan | 0.9202 | 0.9451 | | tf_efficientnet_b0 | 128 | 0.9998 | 0.7727 | nan | 0.8426 | 0.9413 | | mixer_b16_224 | 64 | 0.9956 | 0.9615 | nan | 0.8644 | 0.9357 | | beit_base_patch16_224 | 64 | 1.0 | 0.9575 | nan | 0.8606 | 0.9272 | | gmlp_s16_224 | 64 | 1.0 | 0.9766 | nan | 0.966 | 0.9267 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9469 | nan | 0.8229 | 0.915 | | tnt_s_patch16_224 | 64 | 1.0001 | 0.9752 | nan | 0.8518 | 0.9131 | | volo_d1_224 | 64 | 0.9999 | 0.9247 | nan | 0.7472 | 0.9124 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9476 | nan | 0.8242 | 0.9095 | | spnasnet_100 | 128 | 1.0005 | 0.9207 | nan | 0.8496 | 0.9024 | | selecsls42b | 128 | 0.9883 | 0.8982 | nan | 0.9039 | 0.8999 | | mixnet_l | 64 | 0.9995 | 0.8486 | nan | 0.7938 | 0.8993 | | mobilenetv3_large_100 | 128 | 1.0002 | 0.8686 | nan | 0.8819 | 0.8982 | | adv_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | 0.8977 | | gluon_inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | 0.8977 | | inception_v3 | 128 | 1.0002 | 0.8694 | nan | 0.88 | 0.8977 | | xcit_large_24_p8_224 | 5 | 0.9999 | 0.9206 | nan | nan | 0.8952 | | resnest101e | 32 | 1.0 | 0.9458 | nan | 0.9449 | 0.8922 | | ghostnet_100 | 128 | 0.9998 | 0.8872 | nan | 0.947 | 0.8889 | | visformer_small | 128 | 0.9943 | 0.9442 | nan | 0.9475 | 0.8883 | | convnext_base | 32 | 1.0001 | 0.9077 | nan | 0.7678 | 0.8853 | | fbnetv3_b | 128 | 0.9995 | 0.7866 | nan | 0.7861 | 0.8837 | | gluon_xception65 | 32 | 0.9999 | 0.9384 | nan | 0.9001 | 0.8834 | | dpn107 | 32 | 0.9997 | 0.9285 | nan | 0.8949 | 0.8762 | | twins_pcpvt_base | 32 | 1.0002 | 0.9127 | nan | 0.8351 | 0.8723 | | cspdarknet53 | 64 | 1.0 | 0.8562 | nan | 0.8797 | 0.8624 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9309 | nan | 0.83 | 0.8586 | | jx_nest_base | 32 | 1.0017 | 0.898 | nan | 0.7112 | 0.8574 | | ese_vovnet19b_dw | 128 | 0.9999 | 0.8938 | nan | 0.9369 | 0.8467 | | sebotnet33ts_256 | 64 | 1.0 | 0.7109 | nan | 0.6852 | 0.841 | | swsl_resnext101_32x16d | 32 | 1.0003 | 0.8983 | nan | 0.8684 | 0.8402 | | resmlp_12_224 | 128 | 0.9893 | 0.9525 | 0.2479 | nan | 0.8169 | | res2net101_26w_4s | 64 | 1.0001 | 0.9307 | nan | 0.8959 | 0.8167 | | crossvit_9_240 | 64 | 1.0001 | 0.8721 | nan | 0.729 | 0.8108 | | mnasnet_100 | 128 | 1.0003 | 0.9126 | nan | 0.8368 | 0.7984 | | pit_b_224 | 64 | 0.9992 | 0.7962 | nan | 0.6417 | 0.7921 | | coat_lite_mini | 128 | 1.0049 | 0.8826 | nan | 0.7873 | 0.79 | | lcnet_050 | 128 | 1.0005 | 0.7721 | nan | 0.7722 | 0.7579 | | regnety_002 | 128 | 0.9981 | 0.829 | nan | 0.7759 | 0.7465 | | gernet_l | 128 | 1.0 | 0.7965 | nan | 0.8012 | 0.727 | | botnet26t_256 | 128 | 1.0 | 0.8494 | nan | 0.7497 | 0.7254 | | fbnetc_100 | 128 | 0.9998 | 0.8597 | nan | 0.7507 | 0.7246 | | hrnet_w18 | 2 | 0.9986 | 0.8792 | nan | 0.8869 | 0.6089 | | res2next50 | 2 | 1.0 | 0.8353 | nan | 0.8404 | 0.6063 | | res2net50_14w_8s | 2 | 1.0 | 0.8387 | nan | 0.8474 | 0.5877 | | repvgg_a2 | 128 | 1.0003 | 0.8145 | nan | 0.6633 | 0.536 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/ZS8ageP.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/dtxLb5l.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/9GXAJkl.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 92%, 49/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 94%, 50/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 26%, 14/53 |  0%, 0/43   |  11%, 7/61  |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 81%, 43/53 | 93%, 40/43  | 93%, 57/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    0.0x     |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.19x    |
|    inductor    |   1.71x    |    2.29x    |    1.31x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.68    |    14.88    |    11.61    |
|   aot_eager    |   11.54    |    25.28    |    19.45    |
| aot_cudagraphs |    7.10    |     0.0     |    52.59    |
|  aot_nvfuser   |   29.15    |     0.0     |    78.59    |
|    inductor    |   215.79   |   112.71    |   397.63    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.86x    |    0.87x    |    0.88x    |
| aot_cudagraphs |   0.44x    |    0.0x     |    0.20x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.77x    |    0.82x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 0.9967 | 0.9096 | 0.0 | 1.3892 | 4.9549 | | functorch_dp_cifar10 | 64 | 1.0015 | 0.917 | 0.0 | 1.1984 | 4.7933 | | timm_efficientdet | 1 | 0.9846 | 0.8101 | 0.0 | 0.0 | 4.1584 | | timm_vision_transformer | 8 | 1.0027 | 0.8733 | 0.0 | 1.3673 | 3.1169 | | BERT_pytorch | 16 | 1.0109 | 0.8357 | 0.0 | 0.0 | 3.0625 | | drq | 1 | 0.9999 | 0.7995 | 0.0 | 1.1068 | 3.0176 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.91 | 1.3034 | 1.2183 | 2.6191 | | resnet18 | 16 | 1.0025 | 0.9953 | 0.0 | 1.3 | 2.6171 | | dcgan | 32 | 0.9936 | 0.9007 | 1.1355 | 0.7427 | 2.5545 | | hf_Albert | 8 | 1.0004 | 0.9555 | 0.0 | 0.0 | 2.3929 | | pytorch_struct | 200 | 0.9898 | 0.7374 | 1.0146 | 1.0055 | 2.2997 | | hf_T5_large | 2 | 1.0167 | 0.8597 | 0.0 | 0.0 | 2.2877 | | squeezenet1_1 | 32 | 0.9992 | 0.9618 | 1.3556 | 1.1931 | 2.2867 | | resnext50_32x4d | 8 | 0.9987 | 0.9506 | 0.0 | 1.3205 | 2.1799 | | hf_T5 | 8 | 0.998 | 0.9443 | 0.0 | 0.0 | 2.1468 | | hf_Bart | 4 | 1.0159 | 0.8354 | 0.0 | 0.0 | 2.0957 | | lennard_jones | 1000 | 0.9747 | 0.7475 | 1.2902 | 1.0387 | 2.0271 | | hf_GPT2 | 4 | 1.0204 | 0.8863 | 0.0 | 0.0 | 2.02 | | mobilenet_v3_large | 32 | 1.0041 | 1.0119 | 0.0 | 1.4094 | 2.0149 | | mnasnet1_0 | 32 | 0.9976 | 1.0169 | 0.8904 | 1.406 | 1.9422 | | hf_Bert | 4 | 1.03 | 0.8543 | 0.0 | 0.0 | 1.8929 | | LearningToPaint | 96 | 1.0076 | 0.9997 | 0.0 | 1.3645 | 1.8479 | | timm_efficientnet | 32 | 0.9601 | 0.8092 | 0.0 | 1.1852 | 1.7838 | | attention_is_all_you_need_pytorch | 256 | 1.0074 | 0.9148 | 0.0 | 0.0 | 1.4975 | | hf_DistilBert | 8 | 1.0018 | 0.9687 | 0.0 | 0.0 | 1.4794 | | fastNLP_Bert | 6 | 0.9995 | 0.8925 | 0.0 | 0.0 | 1.4599 | | soft_actor_critic | 256 | 1.0107 | 0.7358 | 1.2675 | 1.0585 | 1.4593 | | pytorch_unet | 1 | 0.9995 | 0.9929 | 0.0 | 1.156 | 1.3537 | | timm_nfnet | 128 | 0.9999 | 0.9989 | 0.0 | 1.1729 | 1.3393 | | pytorch_stargan | 16 | 0.9983 | 1.0276 | 0.8223 | 1.0905 | 1.3171 | | Super_SloMo | 6 | 0.9997 | 0.996 | 0.0 | 0.0 | 1.2922 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0161 | 0.0 | 1.3367 | 1.2891 | | vgg16 | 64 | 0.9998 | 0.9977 | 0.7983 | 0.995 | 1.2721 | | Background_Matting | 4 | 0.9992 | 1.0179 | 0.0 | 1.1151 | 1.2162 | | alexnet | 128 | 0.9996 | 0.9969 | 0.7886 | 1.0039 | 1.2097 | | timm_vision_transformer_large | 8 | 0.9991 | 0.9898 | 0.0 | 0.9925 | 1.1592 | | hf_Reformer | 4 | 0.996 | 0.9993 | 0.9194 | 0.0 | 1.1587 | | hf_BigBird | 2 | 0.9915 | 0.9188 | 0.0 | 0.0 | 1.1515 | | timm_resnest | 32 | 1.0016 | 1.021 | 0.0 | 1.3137 | 1.1411 | | timm_vovnet | 32 | 0.9188 | 0.8863 | 0.0 | 1.1259 | 1.0895 | | tts_angular | 64 | 0.9927 | 0.9561 | 1.013 | 1.0041 | 1.013 | | demucs | 4 | 0.9976 | 1.0028 | 1.0002 | 0.9991 | 0.9992 | | nvidia_deeprecommender | 256 | 0.9993 | 0.9958 | 0.6964 | 0.9789 | 0.9894 | | mobilenet_v2 | 96 | 0.9989 | 0.9868 | 0.0 | 0.9239 | 0.9572 | | resnet50 | 32 | 0.9998 | 1.0112 | 0.0 | 1.3797 | 0.8611 | | timm_regnet | 32 | 0.9796 | 0.9386 | 0.0 | 1.183 | 0.7142 | | yolov3 | 16 | 0.9991 | 0.9882 | 0.0 | 0.9225 | 0.0 | | dlrm | 2048 | 0.0 | 1.2114 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 0.9994 | 0.9897 | 0.0 | 0.0 | 0.0 | | speech_transformer | 32 | 1.0051 | 0.841 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ | timm_efficientdet | 1 | 51.6778 | 76.315 | nan | nan | 1561.4016 | | densenet121 | 4 | 13.5643 | 28.592 | nan | 137.0015 | 1284.8246 | | hf_T5_large | 2 | 35.6597 | 74.143 | nan | nan | 1105.8679 | | mnasnet1_0 | 32 | 3.2898 | 8.4607 | 42.168 | 45.5088 | 759.1611 | | mobilenet_v3_large | 32 | 3.7676 | 9.0181 | nan | 74.5093 | 688.5086 | | mobilenet_v2 | 96 | 3.2506 | 8.2172 | nan | 42.8435 | 560.8749 | | resnext50_32x4d | 8 | 3.4647 | 8.893 | nan | 39.1118 | 547.4423 | | timm_efficientnet | 32 | 5.8589 | 12.2274 | nan | 72.3203 | 415.9198 | | shufflenet_v2_x1_0 | 128 | 3.7576 | 9.3836 | nan | 40.0699 | 357.9817 | | squeezenet1_1 | 32 | 0.6616 | 1.757 | 6.8976 | 6.7878 | 340.2634 | | timm_resnest | 32 | 1.3961 | 4.1828 | nan | 42.9349 | 303.1367 | | timm_nfnet | 128 | 6.7717 | 13.0572 | nan | 41.5275 | 293.2979 | | resnet50 | 32 | 3.4893 | 8.8065 | nan | 43.3898 | 266.9493 | | timm_regnet | 32 | 8.3816 | 16.5065 | nan | 65.8429 | 250.7005 | | attention_is_all_you_need_pytorch | 256 | 4.3319 | 12.2357 | nan | nan | 227.2056 | | timm_vovnet | 32 | 2.9802 | 7.155 | nan | 31.7436 | 193.0371 | | timm_vision_transformer_large | 8 | 22.5208 | 39.1907 | nan | 57.9043 | 179.7974 | | functorch_dp_cifar10 | 64 | 0.8221 | 2.5554 | nan | 6.2734 | 173.3004 | | timm_vision_transformer | 8 | 3.1139 | 7.985 | nan | 15.9315 | 168.9022 | | resnet18 | 16 | 1.0142 | 3.0134 | nan | 23.5973 | 160.1865 | | BERT_pytorch | 16 | 5.0367 | 13.1083 | nan | nan | 155.0612 | | LearningToPaint | 96 | 1.0619 | 3.0807 | nan | 30.7452 | 147.1359 | | hf_T5 | 8 | 3.9756 | 12.0963 | nan | nan | 136.5616 | | pytorch_stargan | 16 | 0.8302 | 3.1387 | 11.232 | 7.4876 | 129.5533 | | Background_Matting | 4 | 4.0333 | 9.316 | nan | 46.2135 | 126.0886 | | fastNLP_Bert | 6 | 5.3273 | 12.3503 | nan | nan | 123.1903 | | hf_Bart | 4 | 7.4307 | 16.7695 | nan | nan | 118.9238 | | hf_GPT2 | 4 | 3.6314 | 9.7759 | nan | nan | 116.4341 | | pytorch_struct | 200 | 0.4314 | 1.2309 | 1.8299 | 5.3945 | 104.9248 | | Super_SloMo | 6 | 2.2312 | 6.7703 | nan | nan | 71.7895 | | hf_Albert | 8 | 1.4443 | 8.1906 | nan | nan | 67.9052 | | hf_Bert | 4 | 5.1953 | 12.1324 | nan | nan | 63.7948 | | hf_Reformer | 4 | 3.0625 | 5.7244 | 13.5967 | nan | 59.1986 | | hf_BigBird | 2 | 11.5114 | 19.9501 | nan | nan | 58.435 | | pytorch_unet | 1 | 1.1142 | 3.3664 | nan | 26.4582 | 47.7954 | | hf_DistilBert | 8 | 1.7462 | 5.1916 | nan | nan | 42.315 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.7671 | 3.1052 | 11.8215 | 5.0147 | 26.9302 | | vgg16 | 64 | 0.3619 | 1.0556 | 4.1097 | 3.6622 | 23.0057 | | drq | 1 | 0.2833 | 0.7285 | nan | 4.4646 | 17.2654 | | alexnet | 128 | 0.2731 | 0.6766 | 1.9416 | 3.2376 | 17.0553 | | dcgan | 32 | 0.2645 | 0.6215 | 1.8704 | 4.2904 | 12.3628 | | nvidia_deeprecommender | 256 | 0.284 | 0.6579 | 0.9898 | 2.9939 | 10.0325 | | soft_actor_critic | 256 | 0.2644 | 0.4764 | 0.7905 | 2.0864 | 8.4274 | | lennard_jones | 1000 | 0.2383 | 0.5024 | 0.6872 | 1.5169 | 4.9677 | | tts_angular | 64 | 0.3313 | 0.3901 | 0.5131 | 1.1311 | 2.5717 | | demucs | 4 | 0.8855 | 0.8786 | 0.8858 | 0.8788 | 0.7821 | | yolov3 | 16 | 7.5276 | 15.1126 | nan | 44.979 | nan | | hf_GPT2_large | 4 | 21.466 | 41.0854 | nan | nan | nan | | speech_transformer | 32 | 7.4505 | 16.6613 | nan | nan | nan | | dlrm | 2048 | nan | 1.1404 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | hf_Albert | 8 | 0.9814 | 0.936 | nan | nan | 1.1576 | | Super_SloMo | 6 | 1.0024 | 0.9697 | nan | nan | 1.1385 | | timm_nfnet | 128 | 0.9761 | 0.9043 | nan | 0.9504 | 1.0242 | | tts_angular | 64 | 1.0015 | 1.0015 | 0.9866 | 1.0015 | 0.9908 | | attention_is_all_you_need_pytorch | 256 | 0.9976 | 0.9403 | nan | nan | 0.9875 | | demucs | 4 | 0.987 | 0.987 | 0.987 | 0.987 | 0.987 | | timm_efficientdet | 1 | 1.0316 | 0.8425 | nan | nan | 0.9858 | | BERT_pytorch | 16 | 0.9991 | 0.8819 | nan | nan | 0.9728 | | timm_efficientnet | 32 | 0.9982 | 0.7762 | nan | 0.7936 | 0.9689 | | hf_GPT2 | 4 | 0.971 | 0.8627 | nan | nan | 0.9645 | | Background_Matting | 4 | 1.0196 | 0.9679 | nan | 0.987 | 0.9244 | | mobilenet_v2 | 96 | 1.0001 | 0.7725 | nan | 0.9235 | 0.8856 | | pytorch_unet | 1 | 0.9968 | 0.8677 | nan | 0.8518 | 0.8681 | | fastNLP_Bert | 6 | 1.0013 | 0.8966 | nan | nan | 0.8661 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8751 | 0.2634 | 0.8432 | 0.8602 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8535 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | nan | nan | 0.8387 | | hf_Bert | 4 | 0.9844 | 0.8677 | nan | nan | 0.8383 | | timm_regnet | 32 | 0.9999 | 0.8483 | nan | 0.85 | 0.8362 | | hf_Bart | 4 | 0.9099 | 0.8321 | nan | nan | 0.8151 | | hf_BigBird | 2 | 0.9852 | 0.9787 | nan | nan | 0.81 | | timm_vovnet | 32 | 0.9903 | 0.7754 | nan | 0.7817 | 0.7861 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 0.874 | nan | 0.8652 | 0.7812 | | pytorch_stargan | 16 | 0.9929 | 0.9799 | 0.2149 | 0.8882 | 0.7783 | | dcgan | 32 | 1.0 | 0.7949 | 0.343 | 0.7073 | 0.7527 | | vgg16 | 64 | 0.9998 | 0.7378 | 0.2978 | 0.7172 | 0.7491 | | timm_vision_transformer_large | 8 | 0.9987 | 0.8366 | nan | 0.8491 | 0.7487 | | alexnet | 128 | 1.0003 | 0.8082 | 0.4354 | 0.805 | 0.7352 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.7266 | | timm_resnest | 32 | 0.9868 | 0.8809 | nan | 0.8726 | 0.722 | | timm_vision_transformer | 8 | 1.0001 | 0.8868 | nan | 0.8871 | 0.7151 | | resnet50 | 32 | 1.0004 | 0.8678 | nan | 0.8041 | 0.6751 | | mnasnet1_0 | 32 | 0.9994 | 0.8793 | 0.173 | 0.8217 | 0.6596 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.295 | 0.7589 | 0.6595 | | mobilenet_v3_large | 32 | 0.999 | 0.8661 | nan | 0.874 | 0.6573 | | resnext50_32x4d | 8 | 1.0 | 0.8591 | nan | 0.823 | 0.6514 | | drq | 1 | 0.9125 | 0.8399 | nan | 0.8395 | 0.6406 | | soft_actor_critic | 256 | 0.964 | 0.9151 | 0.4737 | 0.9151 | 0.6279 | | LearningToPaint | 96 | 0.9252 | 0.7196 | nan | 0.71 | 0.605 | | densenet121 | 4 | 1.0 | 0.8696 | nan | 0.8376 | 0.574 | | resnet18 | 16 | 0.9782 | 0.7852 | nan | 0.7268 | 0.5644 | | lennard_jones | 1000 | 1.0 | 1.0002 | 0.3735 | 1.0967 | 0.564 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5262 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8131 | nan | 0.846 | 0.4465 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | | hf_Reformer | 4 | 0.3764 | 0.9993 | 0.2539 | nan | 0.3629 | | yolov3 | 16 | 1.0054 | 0.8488 | nan | 0.8244 | nan | | speech_transformer | 32 | 1.0015 | 0.9177 | nan | nan | nan | | hf_GPT2_large | 4 | 0.9586 | 0.8649 | nan | nan | nan | | dlrm | 2048 | nan | 0.7282 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | ElectraForCausalLM | 1 | 1.0305 | 0.841 | 0.0 | 0.0 | 6.4479 | | MT5ForConditionalGeneration | 2 | 1.0219 | 0.8629 | 0.0 | 0.0 | 5.9505 | | MobileBertForMaskedLM | 16 | 1.0133 | 0.825 | 0.0 | 0.0 | 5.7095 | | MobileBertForQuestionAnswering | 32 | 1.0128 | 0.8271 | 0.0 | 0.0 | 5.2263 | | MegatronBertForCausalLM | 2 | 1.0371 | 0.8533 | 0.0 | 0.0 | 4.7686 | | YituTechConvBert | 1 | 1.0204 | 0.843 | 0.0 | 0.0 | 4.4867 | | OPTForCausalLM | 4 | 1.014 | 0.8312 | 0.0 | 0.0 | 4.4473 | | CamemBert | 1 | 1.0396 | 0.8479 | 0.0 | 0.0 | 4.0705 | | RobertaForCausalLM | 4 | 1.0385 | 0.8416 | 0.0 | 0.0 | 3.957 | | M2M100ForConditionalGeneration | 2 | 1.0398 | 0.8205 | 0.0 | 0.0 | 3.8162 | | PegasusForConditionalGeneration | 4 | 1.0104 | 0.828 | 0.0 | 0.0 | 3.2005 | | MegatronBertForQuestionAnswering | 8 | 1.0348 | 0.8577 | 0.0 | 0.0 | 3.0658 | | XGLMForCausalLM | 1 | 1.013 | 0.8145 | 0.0 | 0.0 | 3.0613 | | DistillGPT2 | 1 | 1.0299 | 0.8924 | 0.0 | 0.0 | 2.7432 | | MBartForConditionalGeneration | 8 | 1.0143 | 0.8354 | 0.0 | 0.0 | 2.734 | | PLBartForConditionalGeneration | 8 | 1.0174 | 0.8363 | 0.0 | 0.0 | 2.7186 | | DistilBertForMaskedLM | 16 | 1.0288 | 0.8563 | 0.0 | 0.0 | 2.2064 | | GPT2ForSequenceClassification | 4 | 0.9981 | 0.9775 | 0.0 | 0.0 | 2.159 | | Speech2Text2ForCausalLM | 64 | 1.0037 | 0.8487 | 0.0 | 0.0 | 2.1102 | | DistilBertForQuestionAnswering | 32 | 1.0327 | 0.8488 | 0.0 | 0.0 | 2.0944 | | BartForConditionalGeneration | 1 | 1.0227 | 0.8343 | 0.0 | 0.0 | 2.0432 | | ElectraForQuestionAnswering | 64 | 0.9986 | 0.9773 | 0.0 | 0.0 | 1.9714 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0134 | 0.8917 | 0.0 | 0.0 | 1.9272 | | TrOCRForCausalLM | 8 | 1.0107 | 0.8345 | 0.0 | 0.0 | 1.855 | | PegasusForCausalLM | 8 | 1.0087 | 0.8191 | 0.0 | 0.0 | 1.8131 | | LayoutLMForSequenceClassification | 16 | 0.9978 | 0.9792 | 0.0 | 0.0 | 1.7478 | | T5ForConditionalGeneration | 4 | 0.9988 | 0.9365 | 0.0 | 0.0 | 1.695 | | AlbertForQuestionAnswering | 2 | 1.0007 | 0.8085 | 0.0 | 0.0 | 1.6688 | | AlbertForMaskedLM | 2 | 1.0007 | 0.8083 | 0.0 | 0.0 | 1.6612 | | XLNetLMHeadModel | 4 | 1.0 | 0.963 | 0.0 | 0.0 | 1.5949 | | LayoutLMForMaskedLM | 16 | 0.9986 | 0.97 | 0.0 | 0.0 | 1.5936 | | PLBartForCausalLM | 16 | 1.0129 | 0.9321 | 0.0 | 0.0 | 1.5853 | | T5Small | 1 | 1.0296 | 0.8925 | 0.0 | 0.0 | 1.5687 | | MBartForCausalLM | 16 | 1.0117 | 0.9215 | 0.0 | 0.0 | 1.5056 | | DebertaForQuestionAnswering | 4 | 0.9363 | 0.7268 | 0.9281 | 0.0 | 1.494 | | BartForCausalLM | 2 | 1.0021 | 0.9642 | 0.0 | 0.0 | 1.4697 | | BertForQuestionAnswering | 64 | 0.9972 | 0.9687 | 0.0 | 0.0 | 1.4503 | | RobertaForQuestionAnswering | 64 | 0.9979 | 0.9532 | 0.0 | 0.0 | 1.4467 | | BertForMaskedLM | 64 | 0.9975 | 0.9547 | 0.0 | 0.0 | 1.3312 | | BlenderbotSmallForCausalLM | 64 | 1.0008 | 0.9228 | 0.0 | 0.0 | 1.3039 | | DebertaForMaskedLM | 4 | 0.9343 | 0.7295 | 0.7896 | 0.0 | 1.2209 | | BigBird | 1 | 0.9916 | 0.9128 | 0.0 | 0.0 | 1.1516 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 17.6179 | 39.6668 | nan | nan | 307.6319 | | MobileBertForMaskedLM | 16 | 133.5692 | 170.8431 | nan | nan | 286.9824 | | MobileBertForQuestionAnswering | 32 | 132.5387 | 170.3262 | nan | nan | 278.2839 | | T5ForConditionalGeneration | 4 | 3.8127 | 12.0723 | nan | nan | 206.6256 | | M2M100ForConditionalGeneration | 2 | 26.0986 | 44.545 | nan | nan | 195.2289 | | MT5ForConditionalGeneration | 2 | 6.481 | 19.2831 | nan | nan | 190.4766 | | YituTechConvBert | 1 | 9.2564 | 20.0505 | nan | nan | 180.8604 | | MBartForConditionalGeneration | 8 | 26.6452 | 45.9743 | nan | nan | 161.8561 | | PegasusForConditionalGeneration | 4 | 26.2002 | 43.7253 | nan | nan | 155.866 | | DebertaForMaskedLM | 4 | 7.5448 | 14.5382 | 53.5019 | nan | 149.3218 | | BartForConditionalGeneration | 1 | 26.2887 | 45.4355 | nan | nan | 145.9362 | | XGLMForCausalLM | 1 | 15.2816 | 29.6837 | nan | nan | 145.1717 | | T5Small | 1 | 3.8167 | 12.0291 | nan | nan | 136.6475 | | MegatronBertForQuestionAnswering | 8 | 16.77 | 31.1625 | nan | nan | 135.7728 | | MegatronBertForCausalLM | 2 | 17.4451 | 31.1547 | nan | nan | 133.7836 | | BlenderbotSmallForConditionalGeneration | 32 | 12.2822 | 24.5287 | nan | nan | 119.2236 | | PLBartForConditionalGeneration | 8 | 7.6989 | 17.1246 | nan | nan | 118.2111 | | DebertaForQuestionAnswering | 4 | 7.3815 | 14.6553 | 53.7852 | nan | 116.1217 | | RobertaForCausalLM | 4 | 5.3546 | 12.5031 | nan | nan | 94.3553 | | LayoutLMForSequenceClassification | 16 | 5.5432 | 12.7514 | nan | nan | 87.7957 | | PegasusForCausalLM | 8 | 10.1602 | 16.7714 | nan | nan | 85.5372 | | ElectraForQuestionAnswering | 64 | 5.2363 | 12.2405 | nan | nan | 82.7418 | | LayoutLMForMaskedLM | 16 | 5.6199 | 12.9982 | nan | nan | 78.3614 | | OPTForCausalLM | 4 | 4.888 | 11.5661 | nan | nan | 78.2897 | | MBartForCausalLM | 16 | 10.3936 | 16.9171 | nan | nan | 78.0137 | | BartForCausalLM | 2 | 9.9759 | 16.5991 | nan | nan | 75.759 | | GPT2ForSequenceClassification | 4 | 3.8248 | 10.0255 | nan | nan | 74.4963 | | BertForMaskedLM | 64 | 5.1393 | 12.0977 | nan | nan | 72.9445 | | ElectraForCausalLM | 1 | 5.3613 | 12.2367 | nan | nan | 66.8685 | | TrOCRForCausalLM | 8 | 10.2076 | 16.6156 | nan | nan | 66.3959 | | DistilBertForQuestionAnswering | 32 | 1.8752 | 5.4203 | nan | nan | 64.5461 | | AlbertForMaskedLM | 2 | 1.5134 | 8.6449 | nan | nan | 63.0632 | | BigBird | 1 | 11.4363 | 19.7217 | nan | nan | 61.6302 | | CamemBert | 1 | 5.1796 | 12.282 | nan | nan | 61.2855 | | BlenderbotSmallForCausalLM | 64 | 4.9225 | 9.5268 | nan | nan | 59.7408 | | BertForQuestionAnswering | 64 | 5.0836 | 12.169 | nan | nan | 58.515 | | PLBartForCausalLM | 16 | 3.2631 | 6.537 | nan | nan | 57.4481 | | RobertaForQuestionAnswering | 64 | 5.133 | 12.1708 | nan | nan | 56.8695 | | DistillGPT2 | 1 | 1.58 | 4.6392 | nan | nan | 55.2511 | | Speech2Text2ForCausalLM | 64 | 3.1898 | 6.7214 | nan | nan | 54.2535 | | DistilBertForMaskedLM | 16 | 1.9262 | 5.4418 | nan | nan | 48.5942 | | AlbertForQuestionAnswering | 2 | 1.6013 | 8.4355 | nan | nan | 41.7357 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+----------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9163 | nan | nan | 1.07 | | XLNetLMHeadModel | 4 | 0.9912 | 0.8791 | nan | nan | 1.0109 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9539 | nan | nan | 1.0002 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9876 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | nan | nan | 0.9871 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | nan | nan | 0.9811 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | nan | nan | 0.9712 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9998 | 0.8996 | nan | nan | 0.9557 | | BartForCausalLM | 2 | 1.0 | 0.8769 | nan | nan | 0.9545 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.9525 | | Speech2Text2ForCausalLM | 64 | 0.9954 | 0.8456 | nan | nan | 0.9452 | | PLBartForCausalLM | 16 | 1.0006 | 0.8667 | nan | nan | 0.9395 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | | BertForQuestionAnswering | 64 | 0.9995 | 0.9315 | nan | nan | 0.9256 | | RobertaForQuestionAnswering | 64 | 0.9995 | 0.9315 | nan | nan | 0.9254 | | DistilBertForMaskedLM | 16 | 0.9991 | 0.8698 | nan | nan | 0.9167 | | BartForConditionalGeneration | 1 | 1.0 | 0.8619 | nan | nan | 0.881 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.6451 | nan | nan | 0.8636 | | MBartForCausalLM | 16 | 1.0 | 0.8398 | nan | nan | 0.8565 | | AlbertForMaskedLM | 2 | 1.0 | 0.6364 | nan | nan | 0.8515 | | BigBird | 1 | 1.0024 | 0.9555 | nan | nan | 0.8349 | | DistilBertForQuestionAnswering | 32 | 0.9987 | 0.8967 | nan | nan | 0.8334 | | PLBartForConditionalGeneration | 8 | 0.9999 | 0.8304 | nan | nan | 0.8252 | | DistillGPT2 | 1 | 1.0006 | 0.7548 | nan | nan | 0.812 | | MBartForConditionalGeneration | 8 | 0.9999 | 0.8187 | nan | nan | 0.7699 | | TrOCRForCausalLM | 8 | 1.0 | 0.7955 | nan | nan | 0.7566 | | CamemBert | 1 | 0.9989 | 0.7872 | nan | nan | 0.7482 | | OPTForCausalLM | 4 | 0.9975 | 0.7501 | nan | nan | 0.7473 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.7407 | | PegasusForCausalLM | 8 | 0.999 | 0.9444 | nan | nan | 0.7324 | | RobertaForCausalLM | 4 | 0.9237 | 0.7741 | nan | nan | 0.7309 | | XGLMForCausalLM | 1 | 0.9999 | 0.9992 | nan | nan | 0.7214 | | MegatronBertForQuestionAnswering | 8 | 0.9051 | 0.8218 | nan | nan | 0.7107 | | MobileBertForMaskedLM | 16 | 0.9985 | 0.8983 | nan | nan | 0.6948 | | PegasusForConditionalGeneration | 4 | 0.9996 | 0.9196 | nan | nan | 0.6769 | | ElectraForCausalLM | 1 | 0.9993 | 0.8955 | nan | nan | 0.6701 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.7726 | nan | nan | 0.6697 | | M2M100ForConditionalGeneration | 2 | 1.0046 | 0.9497 | nan | nan | 0.6614 | | MobileBertForQuestionAnswering | 32 | 1.0142 | 0.9796 | nan | nan | 0.6265 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | nan | nan | 0.6019 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9826 | 0.3599 | nan | 0.4498 | | DebertaForQuestionAnswering | 4 | 0.979 | 1.0568 | 0.3578 | nan | 0.3761 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 0.9998 | 0.9139 | 0.0 | 1.3802 | 5.4677 | | hrnet_w18 | 2 | 1.0017 | 0.9585 | 0.0 | 1.369 | 4.8554 | | res2next50 | 2 | 0.9994 | 0.9278 | 0.0 | 1.364 | 4.5511 | | twins_pcpvt_base | 32 | 1.0033 | 0.8985 | 0.0 | 1.3637 | 2.5585 | | cait_m36_384 | 2 | 1.0002 | 0.8416 | 0.0 | 1.3573 | 2.3142 | | xcit_large_24_p8_224 | 5 | 1.0003 | 0.0 | 0.0 | 0.0 | 2.2223 | | tnt_s_patch16_224 | 64 | 0.9993 | 0.9911 | 0.0 | 1.8322 | 2.0033 | | ghostnet_100 | 128 | 1.0033 | 0.9993 | 0.0 | 1.5251 | 1.8714 | | gmixer_24_224 | 64 | 1.0003 | 0.8866 | 0.6404 | 1.0146 | 1.685 | | volo_d1_224 | 64 | 0.9994 | 0.9935 | 0.0 | 1.1513 | 1.6636 | | crossvit_9_240 | 64 | 1.0044 | 0.9607 | 0.0 | 1.1615 | 1.6279 | | nfnet_l0 | 64 | 1.0053 | 0.835 | 0.0 | 1.1326 | 1.6068 | | lcnet_050 | 128 | 0.9699 | 0.952 | 0.0 | 1.5557 | 1.5851 | | swin_base_patch4_window7_224 | 64 | 0.9993 | 0.9614 | 0.0 | 1.0645 | 1.5371 | | coat_lite_mini | 128 | 0.9995 | 0.9957 | 0.0 | 1.2656 | 1.53 | | regnety_002 | 128 | 0.9783 | 0.9332 | 0.0 | 1.3828 | 1.5147 | | jx_nest_base | 32 | 0.9989 | 0.9914 | 0.0 | 1.2388 | 1.4568 | | resmlp_12_224 | 128 | 1.0004 | 0.998 | 0.7817 | 0.0 | 1.4415 | | gmlp_s16_224 | 64 | 0.9991 | 0.9835 | 0.0 | 1.0521 | 1.4274 | | resnest101e | 32 | 1.0057 | 0.9834 | 0.0 | 1.4266 | 1.408 | | convit_base | 32 | 0.9992 | 0.9916 | 0.0 | 0.0 | 1.404 | | pit_b_224 | 64 | 0.9995 | 0.9942 | 0.0 | 1.0682 | 1.3627 | | mixer_b16_224 | 64 | 0.9994 | 0.9914 | 0.7158 | 0.9678 | 1.3053 | | deit_base_distilled_patch16_224 | 64 | 0.9995 | 0.9912 | 0.0 | 1.0704 | 1.2869 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9779 | 0.0 | 1.0498 | 1.2855 | | dm_nfnet_f0 | 128 | 0.998 | 0.998 | 0.0 | 1.1769 | 1.2806 | | adv_inception_v3 | 128 | 0.9999 | 0.9953 | 0.0 | 1.1943 | 1.2302 | | gluon_inception_v3 | 128 | 0.9998 | 0.9949 | 0.0 | 1.1948 | 1.2192 | | poolformer_m36 | 64 | 0.9994 | 0.9979 | 0.0 | 0.0 | 1.2154 | | vit_base_patch16_224 | 64 | 0.9996 | 0.9932 | 0.0 | 0.9997 | 1.1986 | | inception_v3 | 128 | 1.0 | 0.9909 | 0.0 | 1.1951 | 1.161 | | mixnet_l | 64 | 0.9791 | 0.8891 | 0.0 | 1.0717 | 1.0913 | | visformer_small | 128 | 1.0003 | 1.0008 | 0.0 | 1.0865 | 1.0813 | | mobilevit_s | 32 | 0.9725 | 0.7991 | 0.0 | 1.214 | 1.0725 | | tf_mixnet_l | 64 | 0.9815 | 0.8994 | 0.0 | 1.0653 | 1.0629 | | pnasnet5large | 16 | 1.0049 | 1.0324 | 0.0 | 1.1288 | 1.0268 | | dla102 | 64 | 0.9989 | 0.9909 | 0.0 | 1.3787 | 1.0037 | | mnasnet_100 | 128 | 0.9532 | 0.9444 | 0.6661 | 1.3675 | 0.9873 | | fbnetv3_b | 128 | 0.9531 | 0.9403 | 0.0 | 1.2576 | 0.984 | | mobilenetv3_large_100 | 128 | 0.9544 | 0.9645 | 0.0 | 1.3458 | 0.9124 | | cspdarknet53 | 64 | 0.9431 | 0.9337 | 0.0 | 0.9007 | 0.9021 | | convmixer_768_32 | 32 | 0.9997 | 0.9978 | 0.0 | 1.0526 | 0.8983 | | dpn107 | 32 | 0.9318 | 0.9269 | 0.0 | 0.9758 | 0.8969 | | spnasnet_100 | 128 | 0.9461 | 0.9371 | 0.6562 | 1.3164 | 0.8965 | | selecsls42b | 128 | 0.9998 | 0.9942 | 0.0 | 1.3571 | 0.8944 | | res2net101_26w_4s | 64 | 1.0012 | 0.9975 | 0.0 | 1.3978 | 0.8776 | | mobilenetv2_100 | 128 | 0.9506 | 0.9419 | 0.0 | 0.8663 | 0.8611 | | tinynet_a | 128 | 0.9574 | 0.7997 | 0.0 | 1.076 | 0.8526 | | repvgg_a2 | 128 | 0.9414 | 0.9341 | 0.6584 | 1.1312 | 0.8335 | | tf_efficientnet_b0 | 128 | 0.9644 | 0.8033 | 0.0 | 1.095 | 0.825 | | gernet_l | 128 | 0.9448 | 0.9369 | 0.0 | 1.1404 | 0.7845 | | fbnetc_100 | 128 | 0.952 | 0.9417 | 0.674 | 1.3769 | 0.7503 | | convnext_base | 32 | 1.0049 | 0.9333 | 0.0 | 1.2117 | 0.7498 | | eca_botnext26ts_256 | 64 | 0.9626 | 0.8006 | 0.0 | 1.1068 | 0.742 | | ese_vovnet19b_dw | 128 | 0.9688 | 0.9647 | 0.0 | 1.2439 | 0.7183 | | sebotnet33ts_256 | 64 | 0.9661 | 0.8366 | 0.0 | 1.1168 | 0.7063 | | eca_halonext26ts | 64 | 0.9633 | 0.8053 | 0.0 | 1.1018 | 0.7011 | | rexnet_100 | 128 | 0.9637 | 0.8494 | 0.0 | 1.0369 | 0.6808 | | botnet26t_256 | 128 | 0.9788 | 0.9746 | 0.0 | 1.3462 | 0.6385 | | swsl_resnext101_32x16d | 32 | 0.9986 | 0.98 | 0.0 | 1.0749 | 0.6356 | | gluon_xception65 | 32 | 0.9988 | 0.9869 | 0.0 | 1.0631 | 0.6086 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | fbnetc_100 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | | gmixer_24_224 | 2 | pass | pass | pass | fail_accuracy | pass | | gmlp_s16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | gluon_xception65 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ | hrnet_w18 | 2 | 99.0979 | 139.3754 | nan | 463.9864 | 1335.9117 | | dpn107 | 32 | 13.787 | 27.9523 | nan | 110.6804 | 1217.6839 | | pnasnet5large | 16 | 60.3829 | 87.8325 | nan | 248.5446 | 1163.1213 | | rexnet_100 | 128 | 6.6118 | 13.7642 | nan | 121.3675 | 963.7513 | | res2net50_14w_8s | 2 | 20.2395 | 37.6106 | nan | 122.0419 | 928.2062 | | mobilevit_s | 32 | 5.9861 | 13.2909 | nan | 61.2292 | 866.3209 | | eca_botnext26ts_256 | 64 | 2.5743 | 6.9839 | nan | 63.7844 | 795.1044 | | twins_pcpvt_base | 32 | 26.1399 | 43.1348 | nan | 95.5654 | 757.3713 | | mixnet_l | 64 | 13.3341 | 22.5842 | nan | 87.9736 | 755.6059 | | ghostnet_100 | 128 | 9.3384 | 18.4048 | nan | 96.6532 | 682.6303 | | tinynet_a | 128 | 7.7081 | 15.1645 | nan | 84.1684 | 670.1545 | | resnest101e | 32 | 26.1437 | 45.7787 | nan | 123.6997 | 629.5917 | | fbnetv3_b | 128 | 12.9847 | 23.2924 | nan | 109.3657 | 628.1691 | | coat_lite_mini | 128 | 3.237 | 8.7565 | nan | 34.2442 | 591.0094 | | fbnetc_100 | 128 | 5.694 | 12.217 | 84.0962 | 63.2725 | 561.7668 | | sebotnet33ts_256 | 64 | 3.9547 | 9.6054 | nan | 69.5973 | 553.5012 | | botnet26t_256 | 128 | 2.4441 | 6.5802 | nan | 50.3226 | 525.648 | | eca_halonext26ts | 64 | 2.7144 | 7.2299 | nan | 66.9848 | 482.7002 | | res2next50 | 2 | 7.4752 | 16.7251 | nan | 64.4957 | 479.389 | | dla102 | 64 | 10.7026 | 22.4382 | nan | 96.1266 | 471.0122 | | tf_mixnet_l | 64 | 13.961 | 22.9985 | nan | 88.72 | 464.9684 | | cspdarknet53 | 64 | 6.2238 | 13.3097 | nan | 44.6251 | 462.2021 | | res2net101_26w_4s | 64 | 25.7344 | 45.7234 | nan | 142.637 | 395.0089 | | mnasnet_100 | 128 | 4.1954 | 9.3198 | 61.348 | 52.9114 | 388.1856 | | adv_inception_v3 | 128 | 8.598 | 18.5391 | nan | 104.0052 | 385.8574 | | tf_efficientnet_b0 | 128 | 5.9727 | 12.614 | nan | 81.4109 | 382.4551 | | nfnet_l0 | 64 | 6.072 | 12.7626 | nan | 38.3849 | 361.8579 | | swin_base_patch4_window7_224 | 64 | 12.1335 | 25.372 | nan | 80.6123 | 361.6558 | | convnext_base | 32 | 11.6385 | 18.8718 | nan | 45.8694 | 359.4631 | | regnety_002 | 128 | 5.0311 | 10.6266 | nan | 59.42 | 353.906 | | mobilenetv2_100 | 128 | 4.2209 | 9.0704 | nan | 42.6452 | 347.8093 | | visformer_small | 128 | 2.3983 | 6.3456 | nan | 31.6313 | 331.8432 | | xcit_large_24_p8_224 | 5 | 37.759 | nan | nan | nan | 328.7567 | | ese_vovnet19b_dw | 128 | 2.1377 | 4.9109 | nan | 39.6491 | 324.5284 | | mobilenetv3_large_100 | 128 | 4.528 | 10.5561 | nan | 84.4514 | 305.6073 | | gluon_xception65 | 32 | 15.8294 | 28.8359 | nan | 77.5442 | 299.1182 | | jx_nest_base | 32 | 9.7064 | 19.4019 | nan | 58.4165 | 292.2664 | | cait_m36_384 | 2 | 47.2806 | 71.1456 | nan | 106.2065 | 290.8927 | | crossvit_9_240 | 64 | 7.6978 | 16.8783 | nan | 42.5032 | 282.5118 | | poolformer_m36 | 64 | 13.2782 | 20.9495 | nan | nan | 266.8697 | | gernet_l | 128 | 4.9148 | 10.906 | nan | 47.155 | 236.2008 | | selecsls42b | 128 | 2.4352 | 7.0271 | nan | 51.9936 | 234.33 | | lcnet_050 | 128 | 2.0668 | 5.0306 | nan | 39.1151 | 218.1046 | | spnasnet_100 | 128 | 5.5499 | 11.8294 | 82.3631 | 61.1525 | 198.4862 | | swsl_resnext101_32x16d | 32 | 10.4299 | 21.7146 | nan | 62.1379 | 191.4235 | | volo_d1_224 | 64 | 6.6938 | 14.7397 | nan | 43.8981 | 188.4197 | | inception_v3 | 128 | 8.5993 | 18.9598 | nan | 105.0869 | 188.3471 | | gluon_inception_v3 | 128 | 8.6691 | 18.4348 | nan | 104.8996 | 181.4554 | | tnt_s_patch16_224 | 64 | 12.2527 | 24.6088 | nan | 48.277 | 168.2284 | | convit_base | 32 | 4.0121 | 10.3663 | nan | nan | 161.784 | | pit_b_224 | 64 | 3.8475 | 9.3151 | nan | 27.1769 | 161.3686 | | gmlp_s16_224 | 64 | 9.5917 | 17.1302 | nan | 30.0903 | 137.9361 | | gmixer_24_224 | 64 | 8.7309 | 17.4092 | 61.7159 | 34.1668 | 127.9428 | | repvgg_a2 | 128 | 4.8452 | 10.4251 | 52.4543 | 64.5302 | 117.378 | | dm_nfnet_f0 | 128 | 6.5598 | 13.2818 | nan | 41.8458 | 115.8439 | | resmlp_12_224 | 128 | 2.8258 | 5.9335 | 9.778 | nan | 86.2363 | | mixer_b16_224 | 64 | 3.0313 | 6.8812 | 16.4068 | 18.0477 | 85.4161 | | deit_base_distilled_patch16_224 | 64 | 3.1395 | 7.9059 | nan | 16.7699 | 75.4263 | | beit_base_patch16_224 | 64 | 4.7743 | 10.3559 | nan | 21.286 | 74.7182 | | convmixer_768_32 | 32 | 7.1496 | 14.2997 | nan | 23.4112 | 73.5333 | | vit_base_patch16_224 | 64 | 3.0296 | 7.7881 | nan | 16.2793 | 59.3368 | +---------------------------------+-----+---------+-----------+----------------+-------------+-----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 1.0008 | 0.9563 | 0.2215 | 0.8991 | 1.2587 | | gmlp_s16_224 | 64 | 1.0 | 0.9679 | nan | 0.92 | 1.2405 | | tinynet_a | 128 | 1.0001 | 0.7955 | nan | 0.7958 | 1.1632 | | pnasnet5large | 16 | 1.0583 | 0.9923 | nan | 1.1741 | 1.1265 | | eca_halonext26ts | 64 | 0.9885 | 0.7814 | nan | 0.786 | 1.0888 | | dm_nfnet_f0 | 128 | 0.9758 | 0.9039 | nan | 0.95 | 1.0614 | | tnt_s_patch16_224 | 64 | 1.0 | 0.9718 | nan | 0.9431 | 1.0587 | | volo_d1_224 | 64 | 1.0015 | 0.9518 | nan | 0.8587 | 1.0378 | | convit_base | 32 | 0.9991 | 0.86 | nan | nan | 1.0309 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9367 | nan | 0.9298 | 1.0097 | | mobilevit_s | 32 | 1.0 | 0.7722 | nan | 0.787 | 1.0078 | | rexnet_100 | 128 | 0.9988 | 0.7919 | nan | 0.8648 | 1.001 | | dla102 | 64 | 0.9998 | 0.9549 | nan | 0.9751 | 0.9969 | | pit_b_224 | 64 | 1.0021 | 0.8074 | nan | 0.8179 | 0.9856 | | poolformer_m36 | 64 | 1.0015 | 0.9462 | nan | nan | 0.9797 | | convnext_base | 32 | 1.0065 | 0.908 | nan | 0.7521 | 0.9564 | | twins_pcpvt_base | 32 | 0.9963 | 0.9079 | nan | 0.8007 | 0.9553 | | convmixer_768_32 | 32 | 0.9992 | 0.9807 | nan | 0.9715 | 0.9513 | | visformer_small | 128 | 0.9899 | 0.9353 | nan | 0.8884 | 0.9342 | | resnest101e | 32 | 1.0002 | 0.9762 | nan | 0.9535 | 0.9292 | | tf_mixnet_l | 64 | 0.9995 | 0.8624 | nan | 0.8426 | 0.9291 | | mixer_b16_224 | 64 | 0.9929 | 0.9425 | 0.2532 | 0.7726 | 0.9225 | | tf_efficientnet_b0 | 128 | 1.0006 | 0.7769 | nan | 0.846 | 0.919 | | nfnet_l0 | 64 | 0.9993 | 0.824 | nan | 0.8257 | 0.913 | | mobilenetv2_100 | 128 | 0.9992 | 0.7716 | nan | 0.9249 | 0.8963 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9384 | nan | 0.8801 | 0.8916 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9376 | nan | 0.8794 | 0.8911 | | mobilenetv3_large_100 | 128 | 0.9987 | 0.8562 | nan | 0.8673 | 0.8886 | | adv_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | inception_v3 | 128 | 1.0003 | 0.8759 | nan | 0.8538 | 0.8829 | | gluon_xception65 | 32 | 1.0 | 0.8895 | nan | 0.8854 | 0.8713 | | dpn107 | 32 | 0.9981 | 0.9115 | nan | 0.8834 | 0.87 | | selecsls42b | 128 | 0.9789 | 0.8913 | nan | 0.8811 | 0.8659 | | fbnetv3_b | 128 | 1.0003 | 0.7918 | nan | 0.7903 | 0.8647 | | mixnet_l | 64 | 0.9989 | 0.8507 | nan | 0.7796 | 0.8601 | | spnasnet_100 | 128 | 0.9988 | 0.8961 | 0.1651 | 0.8371 | 0.8599 | | eca_botnext26ts_256 | 64 | 0.9998 | 0.7776 | nan | 0.7811 | 0.8534 | | swsl_resnext101_32x16d | 32 | 1.0009 | 0.8805 | nan | 0.8487 | 0.8523 | | xcit_large_24_p8_224 | 5 | 0.9987 | nan | nan | nan | 0.8489 | | resmlp_12_224 | 128 | 0.9827 | 0.9667 | 0.2637 | nan | 0.845 | | ghostnet_100 | 128 | 1.0013 | 0.8903 | nan | 0.9244 | 0.8329 | | coat_lite_mini | 128 | 1.0338 | 0.929 | nan | 0.6593 | 0.8328 | | ese_vovnet19b_dw | 128 | 1.0 | 0.867 | nan | 0.9146 | 0.8269 | | cspdarknet53 | 64 | 1.0 | 0.8467 | nan | 0.7906 | 0.813 | | cait_m36_384 | 2 | 0.9998 | 0.8806 | nan | 0.9023 | 0.8081 | | jx_nest_base | 32 | 1.0 | 0.8945 | nan | 0.86 | 0.8 | | crossvit_9_240 | 64 | 1.0008 | 0.8801 | nan | 0.8854 | 0.7934 | | res2net101_26w_4s | 64 | 0.9999 | 0.9202 | nan | 0.8569 | 0.7834 | | mnasnet_100 | 128 | 0.9993 | 0.8882 | 0.1669 | 0.8253 | 0.773 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9234 | nan | 0.8451 | 0.7676 | | sebotnet33ts_256 | 64 | 0.9999 | 0.7108 | nan | 0.7354 | 0.7449 | | gernet_l | 128 | 0.9998 | 0.8655 | nan | 0.8299 | 0.7238 | | fbnetc_100 | 128 | 0.9984 | 0.8631 | 0.1626 | 0.7352 | 0.7104 | | lcnet_050 | 128 | 0.9992 | 0.7927 | nan | 0.7885 | 0.705 | | regnety_002 | 128 | 0.9994 | 0.8284 | nan | 0.7819 | 0.6975 | | botnet26t_256 | 128 | 1.0 | 0.8755 | nan | 0.78 | 0.6615 | | res2next50 | 2 | 1.0 | 0.8301 | nan | 0.8198 | 0.6012 | | res2net50_14w_8s | 2 | 1.0 | 0.8275 | nan | 0.8169 | 0.5927 | | hrnet_w18 | 2 | 1.0 | 0.8383 | nan | 0.8363 | 0.5746 | | repvgg_a2 | 128 | 1.0003 | 0.7971 | 0.1444 | 0.6902 | 0.5572 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/torchbench_amp.png : ![](https://i.imgur.com/7JfkKH3.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/44zu6xw.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/cpJtTyO.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 89%, 49/55 | 98%, 43/44  | 100%, 61/61 |
|   aot_eager    | 87%, 48/55 | 98%, 43/44  | 90%, 55/61  |
| aot_cudagraphs | 25%, 14/55 |  0%, 0/44   |  2%, 1/61   |
|  aot_nvfuser   | 58%, 32/55 |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 84%, 46/55 | 93%, 41/44  | 97%, 59/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.03x    |    0.0x     |    1.00x    |
|  aot_nvfuser   |   1.13x    |    1.11x    |    1.12x    |
|    inductor    |   1.49x    |    1.64x    |    1.35x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    1.76    |    2.14     |    2.02     |
|   aot_eager    |    6.40    |    9.10     |    8.87     |
| aot_cudagraphs |    4.48    |     0.0     |    5.79     |
|  aot_nvfuser   |   20.37    |    9.44     |    49.20    |
|    inductor    |   131.38   |   102.01    |   213.14    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    0.99x    |
|   aot_eager    |   0.86x    |    0.89x    |    0.87x    |
| aot_cudagraphs |   0.41x    |    0.0x     |    0.25x    |
|  aot_nvfuser   |   0.83x    |    1.08x    |    0.84x    |
|    inductor    |   0.84x    |    0.77x    |    0.95x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 1.0022 | 1.0056 | 0.0 | 1.4544 | 5.2009 | | timm_efficientdet | 1 | 0.9838 | 0.8876 | 0.0 | 0.0 | 4.2126 | | functorch_dp_cifar10 | 64 | 0.999 | 0.9933 | 0.0 | 1.1961 | 3.6586 | | timm_vision_transformer | 8 | 1.0053 | 0.9219 | 0.0 | 1.3393 | 2.7292 | | drq | 1 | 1.0073 | 0.8137 | 0.0 | 1.0725 | 2.4432 | | resnext50_32x4d | 8 | 1.0033 | 1.053 | 0.0 | 1.3317 | 2.1015 | | mobilenet_v3_large | 32 | 1.0055 | 1.1184 | 0.0 | 1.3849 | 2.079 | | BERT_pytorch | 16 | 1.0097 | 0.8826 | 0.0 | 0.0 | 1.8631 | | pytorch_struct | 200 | 1.008 | 0.736 | 0.8823 | 0.9846 | 1.8608 | | resnet18 | 16 | 1.0058 | 1.1215 | 0.0 | 1.4023 | 1.8249 | | lennard_jones | 1000 | 0.9697 | 0.8417 | 1.0649 | 1.0102 | 1.7782 | | squeezenet1_1 | 32 | 1.0032 | 1.0 | 0.9697 | 1.1615 | 1.753 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.999 | 0.9576 | 1.3201 | 1.2037 | 1.7521 | | hf_Albert | 8 | 1.0012 | 0.9956 | 0.0 | 0.0 | 1.6573 | | dcgan | 32 | 0.9445 | 1.0147 | 1.0422 | 1.1791 | 1.6502 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0852 | 0.0 | 1.1981 | 1.6387 | | hf_T5_large | 2 | 1.0258 | 0.8969 | 0.0 | 0.0 | 1.5504 | | timm_resnest | 32 | 0.9998 | 1.0033 | 0.0 | 1.1847 | 1.5201 | | timm_nfnet | 128 | 0.9994 | 0.9996 | 0.0 | 1.2125 | 1.4753 | | mnasnet1_0 | 32 | 1.0002 | 1.1048 | 0.7027 | 1.2986 | 1.4659 | | hf_GPT2 | 4 | 1.01 | 0.9778 | 0.0 | 0.0 | 1.4239 | | mobilenet_v2 | 96 | 0.9997 | 0.9998 | 0.0 | 1.0391 | 1.4232 | | fastNLP_Bert | 6 | 0.999 | 0.9712 | 0.0 | 0.0 | 1.3718 | | soft_actor_critic | 256 | 0.973 | 0.8119 | 1.0368 | 0.9557 | 1.3261 | | timm_efficientnet | 32 | 0.9563 | 0.8181 | 0.0 | 1.0657 | 1.3067 | | LearningToPaint | 96 | 1.0052 | 1.0615 | 0.0 | 1.2309 | 1.3053 | | resnet50 | 32 | 0.9992 | 0.994 | 0.0 | 1.1627 | 1.205 | | pytorch_unet | 1 | 0.9998 | 0.9988 | 0.0 | 1.0768 | 1.2 | | Super_SloMo | 6 | 1.0 | 0.9969 | 0.0 | 0.0 | 1.1814 | | hf_Bart | 4 | 1.013 | 0.974 | 0.0 | 0.0 | 1.1754 | | vgg16 | 64 | 0.9999 | 0.999 | 0.792 | 0.997 | 1.1709 | | alexnet | 128 | 0.9993 | 0.9979 | 0.7777 | 1.0008 | 1.163 | | hf_Bert | 4 | 1.0321 | 0.9366 | 0.0 | 0.0 | 1.1622 | | hf_DistilBert | 8 | 1.0005 | 0.9564 | 0.0 | 0.0 | 1.1519 | | timm_regnet | 32 | 0.9662 | 0.9642 | 0.0 | 1.0959 | 1.1341 | | pytorch_stargan | 16 | 0.9984 | 0.9809 | 0.7288 | 0.987 | 1.1213 | | Background_Matting | 4 | 1.0005 | 1.0217 | 0.0 | 1.082 | 1.1141 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.8941 | 0.0 | 1.1097 | | hf_BigBird | 2 | 0.9912 | 0.9483 | 0.0 | 0.0 | 1.0763 | | yolov3 | 16 | 1.0 | 0.995 | 0.0 | 1.1849 | 1.0746 | | timm_vision_transformer_large | 8 | 1.0013 | 0.9943 | 0.0 | 0.9823 | 1.0532 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9719 | 0.0 | 0.0 | 1.0425 | | tts_angular | 64 | 0.9871 | 0.9542 | 0.9837 | 1.007 | 1.0047 | | demucs | 4 | 1.0 | 1.0002 | 0.9991 | 1.0001 | 1.0003 | | timm_vovnet | 32 | 0.9126 | 0.9036 | 0.0 | 0.9786 | 0.9892 | | nvidia_deeprecommender | 256 | 0.9995 | 0.9633 | 0.5844 | 0.9422 | 0.9046 | | hf_T5 | 8 | 1.0009 | 0.9918 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 1.0004 | 0.9804 | 0.0 | 0.0 | 0.0 | | speech_transformer | 32 | 1.0032 | 0.9015 | 0.0 | 0.0 | 0.0 | | dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenet_v2_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | resnet50_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v2_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ | yolov3 | 16 | 2.852 | 8.8025 | nan | 43.2264 | 834.5667 | | timm_efficientdet | 1 | 19.6005 | 38.0517 | nan | nan | 780.0168 | | hf_T5_large | 2 | 12.4163 | 41.2037 | nan | nan | 519.4725 | | densenet121 | 4 | 2.1945 | 13.6586 | nan | 89.2202 | 307.282 | | attention_is_all_you_need_pytorch | 256 | 1.0811 | 7.3219 | nan | nan | 253.749 | | timm_resnest | 32 | 0.5654 | 2.7325 | nan | 35.0693 | 226.5594 | | timm_vision_transformer | 8 | 0.7318 | 4.1023 | nan | 9.0512 | 184.8576 | | mobilenet_v3_large | 32 | 0.8745 | 4.9777 | nan | 53.5229 | 164.2159 | | BERT_pytorch | 16 | 1.4586 | 7.4362 | nan | nan | 161.1107 | | timm_vision_transformer_large | 8 | 2.1743 | 13.6492 | nan | 24.1662 | 151.6599 | | timm_efficientnet | 32 | 1.7009 | 6.7188 | nan | 52.4848 | 136.5229 | | fastNLP_Bert | 6 | 1.3882 | 6.6513 | nan | nan | 136.4023 | | pytorch_stargan | 16 | 0.4024 | 2.4373 | 9.2403 | 4.0267 | 135.5749 | | mobilenet_v2 | 96 | 0.7828 | 4.6697 | nan | 37.2951 | 133.8033 | | hf_Bart | 4 | 1.3508 | 7.9191 | nan | nan | 131.1878 | | hf_GPT2 | 4 | 1.2194 | 5.9845 | nan | nan | 128.688 | | mnasnet1_0 | 32 | 0.8055 | 4.7542 | 21.0664 | 31.2578 | 125.661 | | pytorch_struct | 200 | 0.2357 | 0.7743 | 1.5365 | 4.1118 | 121.4775 | | shufflenet_v2_x1_0 | 128 | 0.9448 | 5.95 | nan | 27.0319 | 104.2267 | | timm_nfnet | 128 | 1.8191 | 7.415 | nan | 29.537 | 103.7278 | | resnext50_32x4d | 8 | 0.8908 | 5.2354 | nan | 29.0877 | 98.4197 | | timm_vovnet | 32 | 1.512 | 4.7245 | nan | 23.8082 | 90.9277 | | timm_regnet | 32 | 2.2296 | 8.5346 | nan | 47.2423 | 82.1711 | | Super_SloMo | 6 | 0.9402 | 4.9556 | nan | nan | 81.8682 | | hf_Albert | 8 | 0.9341 | 5.5692 | nan | nan | 73.1288 | | resnet50 | 32 | 0.8693 | 5.1374 | nan | 32.1725 | 71.2104 | | hf_Bert | 4 | 1.3528 | 6.0969 | nan | nan | 67.6511 | | hf_Reformer | 4 | 2.3118 | nan | 12.8556 | nan | 65.6298 | | functorch_dp_cifar10 | 64 | 0.3519 | 1.9863 | nan | 5.5953 | 65.244 | | Background_Matting | 4 | 0.7269 | 4.6671 | nan | 29.3082 | 64.9394 | | resnet18 | 16 | 0.4316 | 1.9408 | nan | 17.7099 | 63.7446 | | pytorch_unet | 1 | 0.4395 | 2.1424 | nan | 19.6082 | 63.716 | | LearningToPaint | 96 | 0.432 | 2.0613 | nan | 23.8935 | 56.4957 | | hf_BigBird | 2 | 7.1325 | 13.2837 | nan | nan | 55.0936 | | hf_DistilBert | 8 | 0.4251 | 2.8964 | nan | nan | 47.9177 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3797 | 2.3411 | 8.012 | 3.9035 | 31.3399 | | squeezenet1_1 | 32 | 0.2311 | 0.939 | 2.6246 | 4.5054 | 30.4021 | | vgg16 | 64 | 0.1849 | 0.6506 | 2.1837 | 2.4324 | 18.7241 | | alexnet | 128 | 0.1545 | 0.4068 | 1.3655 | 2.4079 | 14.8785 | | drq | 1 | 0.1343 | 0.4544 | nan | 3.3517 | 14.8192 | | dcgan | 32 | 0.1845 | 0.4461 | 1.3915 | 3.7557 | 14.3419 | | nvidia_deeprecommender | 256 | 0.1829 | 0.402 | 0.6972 | 2.4366 | 10.8437 | | soft_actor_critic | 256 | 0.1897 | 0.3277 | 0.632 | 1.5219 | 9.788 | | lennard_jones | 1000 | 0.135 | 0.2818 | 0.4335 | 1.0451 | 4.9188 | | tts_angular | 64 | 0.2049 | 0.2723 | 0.3929 | 0.9841 | 4.1684 | | demucs | 4 | 0.3 | 0.3159 | 0.2921 | 0.307 | 0.2244 | | hf_GPT2_large | 4 | 4.696 | 18.8761 | nan | nan | nan | | hf_T5 | 8 | 2.0157 | 9.0035 | nan | nan | nan | | speech_transformer | 32 | 1.5841 | 8.279 | nan | nan | nan | | dlrm | 0 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | timm_efficientnet | 32 | 0.9937 | 0.7666 | nan | 0.7837 | 1.3106 | | Super_SloMo | 6 | 1.0024 | 0.9527 | nan | nan | 1.1857 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | nan | 0.7638 | 1.1005 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.2763 | 0.9742 | 1.0823 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | nan | 0.8662 | 0.9791 | | hf_GPT2 | 4 | 0.9548 | 0.887 | nan | nan | 0.9505 | | timm_regnet | 32 | 0.9985 | 0.8614 | nan | 0.8784 | 0.9284 | | yolov3 | 16 | 0.9957 | 0.844 | nan | 0.8814 | 0.9231 | | Background_Matting | 4 | 0.9998 | 0.9492 | nan | 0.9749 | 0.9139 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.2026 | 1.0085 | 0.9023 | | timm_resnest | 32 | 0.9927 | 0.88 | nan | 0.8024 | 0.8974 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9149 | 0.2326 | 0.9141 | 0.8848 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | nan | 0.8681 | 0.8829 | | hf_Albert | 8 | 0.9333 | 0.9333 | nan | nan | 0.8804 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8737 | | pytorch_unet | 1 | 0.9985 | 0.8521 | nan | 0.8496 | 0.859 | | resnet50 | 32 | 0.9942 | 0.8719 | nan | 0.797 | 0.8565 | | hf_Bert | 4 | 0.9683 | 0.8952 | nan | nan | 0.8564 | | densenet121 | 4 | 0.9904 | 0.8812 | nan | 0.8551 | 0.8562 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.1623 | 0.8263 | 0.8531 | | hf_Bart | 4 | 0.9617 | 0.878 | nan | nan | 0.8531 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | nan | nan | 0.8343 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | nan | 0.8203 | 0.8303 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8284 | | timm_vovnet | 32 | 0.9933 | 0.7603 | nan | 0.7741 | 0.8251 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | | hf_BigBird | 2 | 0.9604 | 0.9604 | nan | nan | 0.8205 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.816 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | nan | nan | 0.7841 | | dcgan | 32 | 0.9754 | 0.7634 | 0.3293 | 0.7634 | 0.767 | | drq | 1 | 0.987 | 0.8777 | nan | 0.8772 | 0.7632 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | | alexnet | 128 | 0.9542 | 0.745 | 0.3738 | 0.7455 | 0.743 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | nan | 0.8104 | 0.712 | | resnet18 | 16 | 0.9831 | 0.7792 | nan | 0.6971 | 0.6902 | | LearningToPaint | 96 | 0.9442 | 0.6902 | nan | 0.6274 | 0.6899 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.2528 | 0.6639 | 0.6471 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4621 | 0.5598 | 0.5598 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | nan | 0.8227 | 0.4056 | | hf_Reformer | 4 | 0.3011 | nan | 0.1798 | nan | 0.299 | | hf_T5 | 8 | 0.9527 | 0.9445 | nan | nan | nan | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | nan | | hf_GPT2_large | 4 | 0.936 | 0.8768 | nan | nan | nan | | dlrm | 0 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | MT5ForConditionalGeneration | 2 | 1.0267 | 0.9006 | 0.0 | 0.0 | 5.0142 | | ElectraForCausalLM | 1 | 1.0418 | 0.914 | 0.0 | 0.0 | 4.5852 | | YituTechConvBert | 1 | 1.028 | 0.9323 | 0.0 | 0.0 | 3.1147 | | MobileBertForMaskedLM | 16 | 1.0238 | 0.9124 | 0.0 | 0.0 | 2.8809 | | MegatronBertForCausalLM | 2 | 1.0386 | 0.9368 | 0.0 | 0.0 | 2.8791 | | RobertaForCausalLM | 4 | 1.0465 | 0.9138 | 0.0 | 0.0 | 2.826 | | MobileBertForQuestionAnswering | 32 | 0.972 | 0.8903 | 0.0 | 0.0 | 2.8063 | | OPTForCausalLM | 4 | 1.0178 | 0.9019 | 0.0 | 0.0 | 2.8057 | | XGLMForCausalLM | 1 | 1.015 | 0.8672 | 0.0 | 0.0 | 2.6862 | | CamemBert | 1 | 1.0505 | 0.9388 | 0.0 | 0.0 | 2.6346 | | M2M100ForConditionalGeneration | 2 | 1.0643 | 0.8763 | 0.0 | 0.0 | 2.6081 | | PegasusForConditionalGeneration | 4 | 1.0109 | 0.9006 | 0.0 | 0.0 | 2.3996 | | DistillGPT2 | 1 | 1.0331 | 0.9401 | 0.0 | 0.0 | 2.2899 | | GoogleFnet | 1 | 1.0042 | 0.8143 | 0.0 | 1.111 | 1.8532 | | PLBartForConditionalGeneration | 8 | 1.0167 | 0.9092 | 0.0 | 0.0 | 1.6989 | | GPT2ForSequenceClassification | 4 | 1.0005 | 0.9773 | 0.0 | 0.0 | 1.6717 | | MegatronBertForQuestionAnswering | 8 | 1.041 | 0.9364 | 0.0 | 0.0 | 1.5637 | | MBartForConditionalGeneration | 8 | 1.0156 | 0.9129 | 0.0 | 0.0 | 1.464 | | XLNetLMHeadModel | 4 | 1.0008 | 0.9635 | 0.0 | 0.0 | 1.4343 | | T5ForConditionalGeneration | 4 | 1.0015 | 0.9594 | 0.0 | 0.0 | 1.4234 | | ElectraForQuestionAnswering | 64 | 1.0002 | 0.9751 | 0.0 | 0.0 | 1.3597 | | AlbertForQuestionAnswering | 2 | 1.001 | 1.0031 | 0.0 | 0.0 | 1.303 | | AlbertForMaskedLM | 2 | 1.0008 | 1.0008 | 0.0 | 0.0 | 1.2966 | | T5Small | 1 | 1.0182 | 0.9491 | 0.0 | 0.0 | 1.2827 | | DebertaForQuestionAnswering | 4 | 0.9315 | 0.7403 | 0.7877 | 0.0 | 1.2626 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9883 | 0.0 | 0.0 | 1.2564 | | TrOCRForCausalLM | 8 | 1.0141 | 0.9401 | 0.0 | 0.0 | 1.2438 | | PegasusForCausalLM | 8 | 1.0113 | 0.9188 | 0.0 | 0.0 | 1.226 | | BartForConditionalGeneration | 1 | 1.0145 | 0.8909 | 0.0 | 0.0 | 1.2146 | | Speech2Text2ForCausalLM | 64 | 1.0079 | 0.9289 | 0.0 | 0.0 | 1.2058 | | PLBartForCausalLM | 16 | 1.0096 | 0.9481 | 0.0 | 0.0 | 1.189 | | DistilBertForQuestionAnswering | 32 | 1.0294 | 0.9855 | 0.0 | 0.0 | 1.1847 | | DistilBertForMaskedLM | 16 | 1.0311 | 0.9766 | 0.0 | 0.0 | 1.1776 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9701 | 0.0 | 0.0 | 1.1692 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0124 | 0.936 | 0.0 | 0.0 | 1.1621 | | DebertaForMaskedLM | 4 | 0.9294 | 0.8226 | 0.7229 | 0.0 | 1.1154 | | BartForCausalLM | 2 | 1.0003 | 0.9663 | 0.0 | 0.0 | 1.1048 | | BigBird | 1 | 0.9947 | 0.9356 | 0.0 | 0.0 | 1.0958 | | MBartForCausalLM | 16 | 1.0065 | 0.9641 | 0.0 | 0.0 | 1.0932 | | BertForQuestionAnswering | 64 | 1.0006 | 0.9837 | 0.0 | 0.0 | 1.0923 | | RobertaForQuestionAnswering | 64 | 1.0005 | 0.9842 | 0.0 | 0.0 | 1.0918 | | BertForMaskedLM | 64 | 1.0003 | 0.9616 | 0.0 | 0.0 | 1.0382 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9092 | 0.0 | 0.0 | 1.0055 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | GoogleFnet | 1 | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 3.5596 | 21.0783 | nan | nan | 312.0249 | | M2M100ForConditionalGeneration | 2 | 2.5809 | 15.2486 | nan | nan | 210.4089 | | MT5ForConditionalGeneration | 2 | 3.1548 | 13.1984 | nan | nan | 176.5319 | | YituTechConvBert | 1 | 2.0595 | 9.5292 | nan | nan | 173.5309 | | T5Small | 1 | 1.9751 | 8.9992 | nan | nan | 155.2823 | | XGLMForCausalLM | 1 | 2.2003 | 11.9917 | nan | nan | 154.3692 | | DebertaForMaskedLM | 4 | 4.5837 | 10.864 | 50.3046 | nan | 151.585 | | T5ForConditionalGeneration | 4 | 1.9764 | 9.0097 | nan | nan | 145.7872 | | MBartForConditionalGeneration | 8 | 2.8053 | 15.4307 | nan | nan | 144.1118 | | PegasusForConditionalGeneration | 4 | 2.6133 | 14.3869 | nan | nan | 139.7853 | | MobileBertForMaskedLM | 16 | 7.8085 | 27.5867 | nan | nan | 139.6162 | | BartForConditionalGeneration | 1 | 2.7435 | 14.8617 | nan | nan | 131.0181 | | PLBartForConditionalGeneration | 8 | 1.3766 | 7.7415 | nan | nan | 129.9398 | | MegatronBertForCausalLM | 2 | 3.1214 | 12.6994 | nan | nan | 127.2605 | | MegatronBertForQuestionAnswering | 8 | 3.0405 | 12.7303 | nan | nan | 125.8278 | | MobileBertForQuestionAnswering | 32 | 7.7234 | 27.5121 | nan | nan | 123.5152 | | DebertaForQuestionAnswering | 4 | 4.5478 | 11.3028 | 50.0742 | nan | 116.463 | | BlenderbotSmallForConditionalGeneration | 32 | 1.7337 | 9.7605 | nan | nan | 114.8053 | | RobertaForCausalLM | 4 | 1.4089 | 6.2231 | nan | nan | 99.0261 | | LayoutLMForSequenceClassification | 16 | 1.47 | 6.3872 | nan | nan | 88.4851 | | GPT2ForSequenceClassification | 4 | 1.2504 | 5.9627 | nan | nan | 86.9855 | | PegasusForCausalLM | 8 | 1.0397 | 5.5399 | nan | nan | 81.3617 | | BertForMaskedLM | 64 | 1.3007 | 6.2134 | nan | nan | 80.6289 | | ElectraForQuestionAnswering | 64 | 1.3165 | 6.1662 | nan | nan | 80.5771 | | MBartForCausalLM | 16 | 0.9793 | 5.7255 | nan | nan | 78.1954 | | OPTForCausalLM | 4 | 1.075 | 5.9214 | nan | nan | 78.0491 | | LayoutLMForMaskedLM | 16 | 1.4685 | 6.7345 | nan | nan | 73.6977 | | BartForCausalLM | 2 | 1.0041 | 5.4568 | nan | nan | 69.3039 | | DistillGPT2 | 1 | 0.6369 | 2.9506 | nan | nan | 65.9677 | | ElectraForCausalLM | 1 | 1.3983 | 6.1053 | nan | nan | 65.3051 | | Speech2Text2ForCausalLM | 64 | 0.5271 | 3.1655 | nan | nan | 64.4654 | | DistilBertForQuestionAnswering | 32 | 0.4682 | 3.0485 | nan | nan | 63.8914 | | TrOCRForCausalLM | 8 | 0.9821 | 5.4934 | nan | nan | 62.5319 | | PLBartForCausalLM | 16 | 0.4722 | 2.9123 | nan | nan | 60.5701 | | AlbertForMaskedLM | 2 | 1.1281 | 5.8455 | nan | nan | 59.6098 | | CamemBert | 1 | 1.3774 | 5.9818 | nan | nan | 59.1032 | | BlenderbotSmallForCausalLM | 64 | 0.6236 | 3.6183 | nan | nan | 58.0221 | | BertForQuestionAnswering | 64 | 1.352 | 6.2505 | nan | nan | 57.4011 | | RobertaForQuestionAnswering | 64 | 1.3173 | 6.1126 | nan | nan | 57.1666 | | BigBird | 1 | 7.2274 | 13.4848 | nan | nan | 56.1339 | | DistilBertForMaskedLM | 16 | 0.4511 | 3.0616 | nan | nan | 50.8777 | | AlbertForQuestionAnswering | 2 | 1.2602 | 5.7535 | nan | nan | 47.6169 | | GoogleFnet | 1 | 0.8012 | 3.1785 | nan | 9.4359 | 39.6472 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0318 | | XLNetLMHeadModel | 4 | 1.0001 | 0.8976 | nan | nan | 0.9717 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.9361 | | BertForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9354 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9467 | nan | nan | 0.9354 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | nan | nan | 0.9339 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.888 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8564 | | XGLMForCausalLM | 1 | 0.9974 | 0.9999 | nan | nan | 0.8528 | | DistilBertForQuestionAnswering | 32 | 1.0 | 0.9046 | nan | nan | 0.8394 | | BartForCausalLM | 2 | 1.0 | 0.8847 | nan | nan | 0.8389 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | nan | nan | 0.8321 | | BartForConditionalGeneration | 1 | 1.0 | 0.8465 | nan | nan | 0.8244 | | BigBird | 1 | 0.999 | 0.9542 | nan | nan | 0.822 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | | DistillGPT2 | 1 | 0.9984 | 0.7704 | nan | nan | 0.8182 | | MBartForCausalLM | 16 | 1.0 | 0.8629 | nan | nan | 0.8181 | | CamemBert | 1 | 0.998 | 0.7977 | nan | nan | 0.8088 | | DistilBertForMaskedLM | 16 | 0.9998 | 0.9138 | nan | nan | 0.8055 | | PLBartForCausalLM | 16 | 1.0 | 0.8805 | nan | nan | 0.8028 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.8025 | | PegasusForCausalLM | 8 | 0.9778 | 0.9323 | nan | nan | 0.802 | | MegatronBertForQuestionAnswering | 8 | 0.923 | 0.8265 | nan | nan | 0.7975 | | MBartForConditionalGeneration | 8 | 1.0 | 0.8136 | nan | nan | 0.7949 | | RobertaForCausalLM | 4 | 0.9058 | 0.7778 | nan | nan | 0.7882 | | TrOCRForCausalLM | 8 | 1.0 | 0.8048 | nan | nan | 0.7873 | | Speech2Text2ForCausalLM | 64 | 0.9565 | 0.8462 | nan | nan | 0.7768 | | GoogleFnet | 1 | 0.9983 | 0.9453 | nan | 1.0813 | 0.7687 | | OPTForCausalLM | 4 | 0.9979 | 0.7508 | nan | nan | 0.763 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0 | 0.9036 | nan | nan | 0.7612 | | PLBartForConditionalGeneration | 8 | 1.0 | 0.8221 | nan | nan | 0.7547 | | PegasusForConditionalGeneration | 4 | 0.9993 | 0.9002 | nan | nan | 0.7318 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | nan | nan | 0.7277 | | M2M100ForConditionalGeneration | 2 | 0.9943 | 0.9857 | nan | nan | 0.7268 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.7066 | nan | nan | 0.7066 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9369 | nan | nan | 0.6763 | | AlbertForMaskedLM | 2 | 0.9999 | 0.9172 | nan | nan | 0.6633 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | nan | nan | 0.6173 | | ElectraForCausalLM | 1 | 1.0 | 0.9107 | nan | nan | 0.6123 | | MobileBertForMaskedLM | 16 | 0.9997 | 0.9179 | nan | nan | 0.5861 | | MobileBertForQuestionAnswering | 32 | 1.0 | 0.9716 | nan | nan | 0.4668 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.352 | nan | 0.4265 | | DebertaForQuestionAnswering | 4 | 0.9845 | 1.0525 | 0.3276 | nan | 0.3569 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 1.0023 | 1.021 | 0.0 | 1.4367 | 5.4417 | | res2next50 | 2 | 1.0032 | 1.0418 | 0.0 | 1.3826 | 4.7325 | | hrnet_w18 | 2 | 1.0074 | 1.0893 | 0.0 | 1.4747 | 4.6638 | | ghostnet_100 | 128 | 0.9989 | 0.9943 | 0.0 | 1.2494 | 1.7883 | | lcnet_050 | 128 | 0.9562 | 0.9498 | 0.0 | 1.5017 | 1.661 | | coat_lite_mini | 128 | 1.0 | 0.9983 | 0.0 | 1.0562 | 1.6088 | | tnt_s_patch16_224 | 64 | 0.9998 | 0.9986 | 0.0 | 1.6019 | 1.541 | | twins_pcpvt_base | 32 | 1.0051 | 0.9666 | 0.0 | 1.3279 | 1.4973 | | regnety_002 | 128 | 0.979 | 0.9918 | 0.0 | 1.3592 | 1.4812 | | dm_nfnet_f0 | 128 | 0.999 | 1.0009 | 0.0 | 1.2127 | 1.4759 | | xcit_large_24_p8_224 | 5 | 1.0027 | 0.9835 | 0.0 | 0.0 | 1.446 | | resnest101e | 32 | 1.0034 | 1.0455 | 0.0 | 1.1884 | 1.4273 | | crossvit_9_240 | 64 | 1.0069 | 0.9985 | 0.0 | 1.0433 | 1.4159 | | nfnet_l0 | 64 | 0.9994 | 0.795 | 0.0 | 1.0534 | 1.395 | | volo_d1_224 | 64 | 1.0 | 0.9962 | 0.0 | 1.1314 | 1.3854 | | dla102 | 64 | 0.9999 | 0.996 | 0.0 | 1.288 | 1.351 | | gmixer_24_224 | 64 | 1.0001 | 0.8412 | 0.0 | 0.9863 | 1.3503 | | mobilenetv2_100 | 128 | 0.9665 | 0.9643 | 0.0 | 1.0157 | 1.3378 | | gluon_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 1.126 | 1.3278 | | inception_v3 | 128 | 0.9999 | 0.999 | 0.0 | 1.1263 | 1.3268 | | adv_inception_v3 | 128 | 1.0 | 0.9987 | 0.0 | 1.1242 | 1.3254 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.963 | 0.0 | 1.168 | 1.3179 | | fbnetv3_b | 128 | 0.9653 | 0.9616 | 0.0 | 1.1356 | 1.2853 | | sebotnet33ts_256 | 64 | 0.9763 | 0.8073 | 0.0 | 1.0541 | 1.2788 | | jx_nest_base | 32 | 1.0 | 0.9946 | 0.0 | 1.2115 | 1.2763 | | cait_m36_384 | 2 | 0.9439 | 0.9721 | 0.0 | 1.0493 | 1.2625 | | botnet26t_256 | 128 | 0.9852 | 0.9851 | 0.0 | 1.2268 | 1.261 | | tf_efficientnet_b0 | 128 | 0.9765 | 0.7824 | 0.0 | 0.9852 | 1.2608 | | mnasnet_100 | 128 | 0.9669 | 0.9629 | 0.0 | 1.1577 | 1.252 | | fbnetc_100 | 128 | 0.9652 | 0.9638 | 0.0 | 1.1891 | 1.251 | | selecsls42b | 128 | 0.9999 | 0.9988 | 0.0 | 1.2104 | 1.2462 | | convit_base | 32 | 0.9996 | 0.9938 | 0.0 | 1.1919 | 1.2416 | | spnasnet_100 | 128 | 0.9609 | 0.9581 | 0.0 | 1.1374 | 1.239 | | eca_halonext26ts | 64 | 0.9747 | 0.7754 | 0.0 | 1.0178 | 1.235 | | eca_botnext26ts_256 | 64 | 0.973 | 0.7702 | 0.0 | 1.0149 | 1.2346 | | cspdarknet53 | 64 | 0.9587 | 0.9548 | 0.0 | 1.1843 | 1.2296 | | res2net101_26w_4s | 64 | 1.0 | 0.9975 | 0.0 | 1.1754 | 1.2288 | | pit_b_224 | 64 | 1.0 | 0.9988 | 0.0 | 1.0547 | 1.2272 | | ese_vovnet19b_dw | 128 | 0.9797 | 0.9775 | 0.0 | 1.1443 | 1.2259 | | pnasnet5large | 16 | 0.9998 | 0.9983 | 0.0 | 1.0842 | 1.2113 | | rexnet_100 | 128 | 0.9726 | 0.8166 | 0.0 | 0.9835 | 1.2025 | | tinynet_a | 128 | 0.9659 | 0.7756 | 0.0 | 0.9721 | 1.1871 | | mobilevit_s | 32 | 0.9619 | 0.7651 | 0.0 | 0.9897 | 1.1815 | | dpn107 | 32 | 0.9575 | 0.9509 | 0.0 | 1.0291 | 1.1804 | | convnext_base | 32 | 0.9998 | 0.9979 | 0.0 | 1.0451 | 1.175 | | repvgg_a2 | 128 | 0.9639 | 0.9632 | 0.0 | 1.1224 | 1.172 | | poolformer_m36 | 64 | 0.9999 | 0.9978 | 0.0 | 0.0 | 1.1689 | | tf_mixnet_l | 64 | 0.9721 | 0.8768 | 0.0 | 1.0043 | 1.1481 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9786 | 0.0 | 0.9883 | 1.1446 | | mixnet_l | 64 | 0.9711 | 0.8729 | 0.0 | 1.0043 | 1.132 | | beit_base_patch16_224 | 64 | 1.0001 | 0.9823 | 0.0 | 0.9491 | 1.1196 | | gmlp_s16_224 | 64 | 1.0 | 0.997 | 0.0 | 0.9949 | 1.1169 | | swsl_resnext101_32x16d | 32 | 0.9998 | 0.9985 | 0.0 | 1.1071 | 1.1109 | | deit_base_distilled_patch16_224 | 64 | 0.9994 | 0.9979 | 0.0 | 1.0141 | 1.101 | | vit_base_patch16_224 | 64 | 0.9989 | 0.9986 | 0.0 | 0.9757 | 1.0942 | | gluon_xception65 | 32 | 0.9999 | 0.9974 | 0.0 | 1.0405 | 1.0872 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 1.0624 | 1.0785 | | gernet_l | 128 | 0.974 | 0.9725 | 0.0 | 1.0996 | 1.0758 | | mixer_b16_224 | 64 | 0.9998 | 0.9984 | 0.0 | 0.9793 | 1.0614 | | visformer_small | 128 | 0.9995 | 1.0021 | 0.0 | 1.0216 | 1.0489 | | resmlp_12_224 | 128 | 0.9998 | 1.0008 | 0.6938 | 0.0 | 1.0361 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | convnext_base | 2 | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | | gluon_xception65 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | fbnetv3_b | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | twins_pcpvt_base | 32 | 2.1545 | 13.4712 | nan | 44.7514 | 818.6787 | | coat_lite_mini | 128 | 0.9362 | 5.3097 | nan | 14.297 | 763.735 | | mobilevit_s | 32 | 1.7381 | 7.53 | nan | 41.8103 | 618.3934 | | swin_base_patch4_window7_224 | 64 | 2.5412 | 13.4929 | nan | 59.0405 | 569.5724 | | eca_botnext26ts_256 | 64 | 1.4052 | 5.1523 | nan | 49.1347 | 504.172 | | eca_halonext26ts | 64 | 1.4655 | 5.5331 | nan | 51.4978 | 468.5572 | | convnext_base | 32 | 1.4166 | 6.3582 | nan | 21.0058 | 400.2996 | | jx_nest_base | 32 | 1.7338 | 9.3534 | nan | 58.7419 | 362.9152 | | sebotnet33ts_256 | 64 | 1.7467 | 6.8392 | nan | 52.1838 | 323.8735 | | xcit_large_24_p8_224 | 5 | 2.6636 | 17.3057 | nan | nan | 294.3484 | | hrnet_w18 | 2 | 6.4389 | 33.866 | nan | 198.4745 | 273.3504 | | crossvit_9_240 | 64 | 1.3803 | 7.9884 | nan | 26.1988 | 271.7885 | | pnasnet5large | 16 | 4.4888 | 24.1218 | nan | 124.0966 | 264.7909 | | botnet26t_256 | 128 | 1.4471 | 4.7322 | nan | 41.4332 | 263.6685 | | rexnet_100 | 128 | 2.0851 | 7.7755 | nan | 103.2984 | 258.9271 | | cait_m36_384 | 2 | 2.729 | 18.1574 | nan | 45.5959 | 251.3507 | | resnest101e | 32 | 3.2183 | 17.3906 | nan | 75.5499 | 242.1176 | | dpn107 | 32 | 4.0859 | 15.2068 | nan | 76.4032 | 235.3065 | | ghostnet_100 | 128 | 2.9307 | 10.4156 | nan | 60.8427 | 231.0067 | | volo_d1_224 | 64 | 1.2606 | 8.0126 | nan | 27.3627 | 227.4007 | | tf_mixnet_l | 64 | 5.7282 | 13.0792 | nan | 62.1662 | 211.3435 | | inception_v3 | 128 | 1.7092 | 9.6808 | nan | 67.1119 | 202.2545 | | tinynet_a | 128 | 2.1547 | 8.1749 | nan | 61.5795 | 201.9643 | | mixnet_l | 64 | 5.616 | 12.6784 | nan | 61.8845 | 194.7373 | | convit_base | 32 | 0.9878 | 6.0601 | nan | 18.4927 | 193.2586 | | res2net50_14w_8s | 2 | 3.0033 | 16.417 | nan | 69.1403 | 190.2239 | | visformer_small | 128 | 0.9399 | 4.1987 | nan | 23.9829 | 187.0458 | | res2net101_26w_4s | 64 | 3.0979 | 18.0078 | nan | 81.3689 | 184.6844 | | fbnetv3_b | 128 | 3.1269 | 11.3755 | nan | 76.457 | 170.2994 | | pit_b_224 | 64 | 0.9264 | 5.1432 | nan | 12.3061 | 168.1234 | | adv_inception_v3 | 128 | 1.7876 | 9.3248 | nan | 67.9682 | 165.1927 | | gluon_inception_v3 | 128 | 1.714 | 9.3951 | nan | 68.3431 | 163.2025 | | tf_efficientnet_b0 | 128 | 2.0104 | 7.3634 | nan | 62.7791 | 159.1046 | | gmlp_s16_224 | 64 | 1.0287 | 6.1202 | nan | 13.4232 | 156.9587 | | mobilenetv3_large_100 | 128 | 1.6485 | 5.7583 | nan | 63.7479 | 143.6383 | | dla102 | 64 | 1.8324 | 10.6917 | nan | 63.422 | 142.5177 | | tnt_s_patch16_224 | 64 | 1.587 | 10.2362 | nan | 23.4649 | 136.1545 | | poolformer_m36 | 64 | 1.9209 | 9.6393 | nan | nan | 133.6576 | | gmixer_24_224 | 64 | 1.1452 | 7.171 | nan | 16.2136 | 132.0786 | | spnasnet_100 | 128 | 2.1882 | 7.0447 | nan | 44.1093 | 125.4818 | | cspdarknet53 | 64 | 2.3501 | 7.9329 | nan | 48.8457 | 124.4366 | | fbnetc_100 | 128 | 2.3825 | 7.3795 | nan | 46.2125 | 124.3139 | | res2next50 | 2 | 1.8125 | 9.0934 | nan | 42.9215 | 124.1669 | | mnasnet_100 | 128 | 1.699 | 5.8737 | nan | 37.7042 | 111.3659 | | mobilenetv2_100 | 128 | 1.7142 | 5.662 | nan | 37.8371 | 110.7728 | | resmlp_12_224 | 128 | 0.5297 | 2.7314 | 5.7904 | nan | 108.3878 | | mixer_b16_224 | 64 | 0.5236 | 3.0921 | nan | 10.8647 | 106.6953 | | gluon_xception65 | 32 | 1.9241 | 11.9308 | nan | 42.924 | 103.912 | | selecsls42b | 128 | 0.7316 | 4.2612 | nan | 39.9213 | 103.8341 | | nfnet_l0 | 64 | 1.7172 | 7.7109 | nan | 26.9313 | 102.6439 | | dm_nfnet_f0 | 128 | 1.9924 | 7.8528 | nan | 29.7799 | 101.0003 | | regnety_002 | 128 | 1.6675 | 6.1593 | nan | 46.7102 | 98.5359 | | ese_vovnet19b_dw | 128 | 0.9743 | 3.2654 | nan | 31.3467 | 94.7237 | | deit_base_distilled_patch16_224 | 64 | 0.8017 | 4.1752 | nan | 10.2984 | 84.1742 | | swsl_resnext101_32x16d | 32 | 1.8167 | 10.3372 | nan | 40.3351 | 82.2403 | | gernet_l | 128 | 2.1434 | 6.8841 | nan | 36.1802 | 78.0454 | | beit_base_patch16_224 | 64 | 1.1627 | 5.4449 | nan | 14.2737 | 75.764 | | lcnet_050 | 128 | 1.0973 | 3.6336 | nan | 32.0027 | 75.5806 | | repvgg_a2 | 128 | 2.0646 | 6.6912 | nan | 44.5267 | 70.8502 | | vit_base_patch16_224 | 64 | 0.7102 | 4.2494 | nan | 9.537 | 65.9388 | | convmixer_768_32 | 32 | 1.2988 | 6.8908 | nan | 13.9593 | 34.1175 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | gmixer_24_224 | 64 | 0.9952 | 0.9645 | nan | 0.9825 | 1.3808 | | tinynet_a | 128 | 0.9942 | 0.7796 | nan | 0.7823 | 1.351 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | | nfnet_l0 | 64 | 0.9948 | 0.8256 | nan | 0.813 | 1.2555 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1879 | | mobilevit_s | 32 | 0.9959 | 0.7668 | nan | 0.741 | 1.141 | | eca_botnext26ts_256 | 64 | 0.9938 | 0.7669 | nan | 0.7642 | 1.1318 | | eca_halonext26ts | 64 | 0.9938 | 0.768 | nan | 0.7694 | 1.1317 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | nan | 0.7635 | 1.1003 | | convit_base | 32 | 0.9977 | 0.8861 | nan | 0.9501 | 1.068 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | | dla102 | 64 | 0.9841 | 0.9148 | nan | 0.9504 | 1.0492 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | nan | 0.9345 | 1.0353 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0219 | | cait_m36_384 | 2 | 0.9998 | 0.902 | nan | 0.9203 | 1.011 | | resnest101e | 32 | 0.9972 | 0.9435 | nan | 0.9425 | 0.9914 | | selecsls42b | 128 | 0.9883 | 0.8896 | nan | 0.8954 | 0.9913 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | nan | 0.9302 | 0.9886 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | nan | 0.784 | 0.9696 | | tf_mixnet_l | 64 | 0.9956 | 0.8577 | nan | 0.8572 | 0.9695 | | mixer_b16_224 | 64 | 0.9956 | 0.9574 | nan | 0.8644 | 0.9357 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 0.9272 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | nan | 0.8932 | 0.9269 | | gmlp_s16_224 | 64 | 0.9958 | 0.9727 | nan | 0.966 | 0.9267 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | nan | 0.8229 | 0.915 | | tnt_s_patch16_224 | 64 | 0.9963 | 0.9715 | nan | 0.8518 | 0.9131 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9124 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.912 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | nan | 0.8242 | 0.9095 | | dpn107 | 32 | 0.9985 | 0.9271 | nan | 0.8941 | 0.9056 | | spnasnet_100 | 128 | 0.989 | 0.9109 | nan | 0.8412 | 0.9047 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | nan | 0.8745 | 0.9007 | | visformer_small | 128 | 0.9943 | 0.9381 | nan | 0.9475 | 0.9006 | | convnext_base | 32 | 0.998 | 0.9059 | nan | 0.7678 | 0.9006 | | mixnet_l | 64 | 0.995 | 0.8449 | nan | 0.7907 | 0.8995 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | nan | 0.8279 | 0.8961 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | | lcnet_050 | 128 | 0.9672 | 0.7521 | nan | 0.7524 | 0.8921 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | nan | 0.8762 | 0.8835 | | twins_pcpvt_base | 32 | 0.9971 | 0.9101 | nan | 0.8351 | 0.8722 | | regnety_002 | 128 | 0.9717 | 0.8104 | nan | 0.7599 | 0.8617 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | nan | 0.745 | 0.8605 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.8585 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.8575 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | nan | 0.7446 | 0.8416 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | | res2net50_14w_8s | 2 | 0.9976 | 0.837 | nan | 0.8458 | 0.8293 | | res2next50 | 2 | 0.9972 | 0.8331 | nan | 0.841 | 0.821 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | | crossvit_9_240 | 64 | 0.9886 | 0.8633 | nan | 0.729 | 0.8063 | | gernet_l | 128 | 0.9884 | 0.7892 | nan | 0.7938 | 0.7928 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.792 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | nan | 0.7873 | 0.7899 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | nan | 0.6573 | 0.7684 | | hrnet_w18 | 2 | 0.9947 | 0.8779 | nan | 0.8833 | 0.6735 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/NFpK3Sn.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/PmYpmBl.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/5TnyQ5g.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 94%, 50/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 94%, 50/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 26%, 14/53 |  0%, 0/43   |  11%, 7/61  |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 83%, 44/53 | 93%, 40/43  | 93%, 57/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.00x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    0.0x     |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.20x    |
|    inductor    |   1.84x    |    2.29x    |    1.55x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    1.94    |    2.55     |    2.30     |
|   aot_eager    |    8.04    |    12.73    |    11.51    |
| aot_cudagraphs |    6.98    |     0.0     |    52.51    |
|  aot_nvfuser   |   27.44    |     0.0     |    71.07    |
|    inductor    |   139.38   |   117.39    |   262.93    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    0.99x    |
|   aot_eager    |   0.85x    |    0.87x    |    0.87x    |
| aot_cudagraphs |   0.43x    |    0.0x     |    0.20x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.83x    |    0.86x    |    0.94x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | densenet121 | 4 | 1.0016 | 0.9128 | 0.0 | 1.4046 | 6.0578 | | functorch_dp_cifar10 | 64 | 0.9984 | 0.9128 | 0.0 | 1.1897 | 4.8757 | | timm_efficientdet | 1 | 0.9866 | 0.8056 | 0.0 | 0.0 | 4.6029 | | resnext50_32x4d | 8 | 1.003 | 0.9596 | 0.0 | 1.3319 | 3.465 | | BERT_pytorch | 16 | 1.0131 | 0.8196 | 0.0 | 0.0 | 3.2081 | | resnet18 | 16 | 1.003 | 0.9918 | 0.0 | 1.3416 | 3.1562 | | timm_vision_transformer | 8 | 1.0071 | 0.8362 | 0.0 | 1.359 | 3.0948 | | mobilenet_v3_large | 32 | 1.0045 | 1.0038 | 0.0 | 1.4187 | 3.0025 | | drq | 1 | 1.0032 | 0.7825 | 0.0 | 1.0929 | 2.9519 | | mnasnet1_0 | 32 | 0.9992 | 1.0109 | 0.9029 | 1.3975 | 2.8887 | | dcgan | 32 | 0.9918 | 0.9024 | 1.1617 | 0.7299 | 2.683 | | hf_T5_large | 2 | 1.0217 | 0.8511 | 0.0 | 0.0 | 2.4185 | | squeezenet1_1 | 32 | 0.9979 | 0.9596 | 1.3413 | 1.1839 | 2.4129 | | hf_Albert | 8 | 1.0004 | 0.954 | 0.0 | 0.0 | 2.3793 | | hf_Bert | 4 | 1.0321 | 0.8543 | 0.0 | 0.0 | 2.2348 | | timm_efficientnet | 32 | 0.9653 | 0.8108 | 0.0 | 1.1785 | 2.1117 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9963 | 0.9087 | 1.3033 | 1.2143 | 2.0914 | | lennard_jones | 1000 | 0.9734 | 0.7478 | 1.2849 | 1.0403 | 2.0222 | | pytorch_struct | 200 | 1.0015 | 0.7365 | 1.0253 | 1.0181 | 1.9967 | | timm_resnest | 32 | 1.0037 | 1.0179 | 0.0 | 1.3234 | 1.9188 | | hf_GPT2 | 4 | 1.0211 | 0.984 | 0.0 | 0.0 | 1.8585 | | LearningToPaint | 96 | 1.0045 | 1.0015 | 0.0 | 1.365 | 1.8496 | | hf_T5 | 8 | 1.0 | 0.9452 | 0.0 | 0.0 | 1.8407 | | resnet50 | 32 | 1.0013 | 1.0069 | 0.0 | 1.3706 | 1.736 | | hf_Bart | 4 | 1.0161 | 0.8234 | 0.0 | 0.0 | 1.7327 | | shufflenet_v2_x1_0 | 128 | 0.9989 | 1.0207 | 0.0 | 1.3415 | 1.7108 | | attention_is_all_you_need_pytorch | 256 | 1.007 | 0.8955 | 0.0 | 0.0 | 1.654 | | mobilenet_v2 | 96 | 0.9999 | 1.0149 | 0.0 | 0.925 | 1.5593 | | hf_DistilBert | 8 | 1.0015 | 0.9712 | 0.0 | 0.0 | 1.5318 | | timm_nfnet | 128 | 0.9993 | 0.999 | 0.0 | 1.1732 | 1.4948 | | soft_actor_critic | 256 | 1.0007 | 0.7214 | 1.3052 | 1.071 | 1.4778 | | fastNLP_Bert | 6 | 0.9985 | 0.8814 | 0.0 | 0.0 | 1.4658 | | timm_regnet | 32 | 0.981 | 0.9283 | 0.0 | 1.1835 | 1.4191 | | pytorch_unet | 1 | 0.9993 | 0.9927 | 0.0 | 1.1557 | 1.3436 | | pytorch_stargan | 16 | 1.0001 | 1.0162 | 0.8287 | 1.102 | 1.3391 | | Super_SloMo | 6 | 1.0 | 0.9959 | 0.0 | 0.0 | 1.2901 | | timm_vovnet | 32 | 0.9207 | 0.8885 | 0.0 | 1.1291 | 1.2719 | | vgg16 | 64 | 0.9995 | 0.9972 | 0.797 | 0.9942 | 1.2699 | | Background_Matting | 4 | 0.9996 | 1.0184 | 0.0 | 1.1152 | 1.2249 | | alexnet | 128 | 0.9991 | 0.9974 | 0.7883 | 1.0033 | 1.2095 | | timm_vision_transformer_large | 8 | 1.0 | 0.9907 | 0.0 | 0.9931 | 1.161 | | hf_Reformer | 4 | 0.9948 | 0.9994 | 0.9196 | 0.0 | 1.1582 | | hf_BigBird | 2 | 0.9939 | 0.912 | 0.0 | 0.0 | 1.152 | | yolov3 | 16 | 0.9998 | 0.9903 | 0.0 | 0.9207 | 1.103 | | tts_angular | 64 | 0.9834 | 0.9287 | 0.9751 | 0.9939 | 1.0307 | | demucs | 4 | 0.999 | 1.0007 | 1.0007 | 1.0011 | 0.999 | | nvidia_deeprecommender | 256 | 0.999 | 0.9962 | 0.6963 | 0.9792 | 0.9887 | | dlrm | 2048 | 1.0007 | 1.114 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 1.0006 | 0.99 | 0.0 | 0.0 | 0.0 | | speech_transformer | 32 | 1.0278 | 0.8292 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | alexnet | 2 | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | pass | pass | | LearningToPaint | 2 | pass | pass | fail_to_run | pass | pass | | densenet121 | 2 | pass | pass | fail_to_run | pass | pass | | drq | 1 | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | fail_to_run | pass | pass | | resnet18 | 2 | pass | pass | fail_to_run | pass | pass | | resnet50 | 2 | pass | pass | fail_to_run | pass | pass | | resnext50_32x4d | 2 | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | Super_SloMo | 2 | pass | pass | fail_to_run | fail_to_run | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | | fastNLP_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Albert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_Bert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_BigBird | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_DistilBert | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | yolov3 | 2 | pass | pass | fail_to_run | fail_to_run | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ | yolov3 | 16 | 3.1052 | 11.2125 | nan | 41.2251 | 906.3816 | | timm_efficientdet | 1 | 20.1053 | 44.9842 | nan | nan | 897.6359 | | hf_T5_large | 2 | 13.096 | 48.6466 | nan | nan | 593.4358 | | densenet121 | 4 | 2.5514 | 17.707 | nan | 130.9259 | 366.322 | | attention_is_all_you_need_pytorch | 256 | 1.2729 | 9.3039 | nan | nan | 266.2822 | | timm_resnest | 32 | 0.645 | 3.6638 | nan | 42.8781 | 233.3464 | | timm_efficientnet | 32 | 1.9662 | 8.6792 | nan | 70.0011 | 207.0271 | | timm_vision_transformer | 8 | 0.9339 | 5.8408 | nan | 13.9729 | 204.3233 | | timm_vision_transformer_large | 8 | 2.9844 | 19.3545 | nan | 38.5976 | 189.3326 | | BERT_pytorch | 16 | 1.7096 | 10.1075 | nan | nan | 177.9434 | | mobilenet_v3_large | 32 | 1.0609 | 6.7853 | nan | 72.766 | 176.0954 | | mnasnet1_0 | 32 | 0.983 | 6.3088 | 41.4888 | 44.2248 | 166.2427 | | hf_T5 | 8 | 2.1692 | 10.9186 | nan | nan | 157.1346 | | mobilenet_v2 | 96 | 0.9477 | 6.7219 | nan | 41.471 | 149.8098 | | fastNLP_Bert | 6 | 1.7424 | 9.0851 | nan | nan | 149.2335 | | pytorch_stargan | 16 | 0.4613 | 2.9484 | 11.3102 | 7.2417 | 148.0754 | | hf_Bart | 4 | 1.7379 | 11.0254 | nan | nan | 147.5629 | | hf_GPT2 | 4 | 1.4723 | 7.7773 | nan | nan | 138.5882 | | timm_vovnet | 32 | 1.6255 | 5.9768 | nan | 31.1606 | 127.3594 | | pytorch_struct | 200 | 0.2725 | 1.134 | 1.789 | 5.3349 | 127.1456 | | timm_regnet | 32 | 2.5147 | 11.0602 | nan | 61.0366 | 123.1854 | | resnext50_32x4d | 8 | 1.0525 | 7.2686 | nan | 36.9586 | 115.97 | | shufflenet_v2_x1_0 | 128 | 1.1291 | 7.4268 | nan | 38.3192 | 111.0196 | | Super_SloMo | 6 | 1.0535 | 6.0007 | nan | nan | 110.8918 | | timm_nfnet | 128 | 2.0271 | 8.8473 | nan | 37.7776 | 110.1183 | | Background_Matting | 4 | 1.0368 | 6.4563 | nan | 42.5834 | 98.0044 | | resnet50 | 32 | 1.0195 | 6.7185 | nan | 41.8437 | 90.5664 | | resnet18 | 16 | 0.4796 | 2.575 | nan | 23.3127 | 88.3143 | | functorch_dp_cifar10 | 64 | 0.3993 | 2.5417 | nan | 6.4062 | 86.9469 | | hf_Albert | 8 | 1.3041 | 8.14 | nan | nan | 85.9413 | | hf_Reformer | 4 | 2.5032 | 5.3906 | 13.8552 | nan | 78.5392 | | hf_Bert | 4 | 1.6042 | 8.8371 | nan | nan | 78.0438 | | pytorch_unet | 1 | 0.5186 | 2.9036 | nan | 26.2702 | 70.4316 | | hf_BigBird | 2 | 8.0991 | 16.5773 | nan | nan | 69.7388 | | LearningToPaint | 96 | 0.5087 | 2.6689 | nan | 30.5027 | 67.9579 | | hf_DistilBert | 8 | 0.6096 | 4.1638 | nan | nan | 54.722 | | squeezenet1_1 | 32 | 0.264 | 1.4091 | 6.414 | 6.5055 | 46.4889 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4526 | 2.8756 | 11.943 | 4.8348 | 42.8301 | | vgg16 | 64 | 0.1981 | 0.9476 | 4.0166 | 3.5663 | 35.4152 | | alexnet | 128 | 0.1666 | 0.6006 | 1.9004 | 3.1682 | 28.0079 | | drq | 1 | 0.1686 | 0.6502 | nan | 4.3912 | 25.5532 | | dcgan | 32 | 0.1826 | 0.5737 | 1.8794 | 4.284 | 20.5771 | | nvidia_deeprecommender | 256 | 0.2133 | 0.6182 | 0.9681 | 2.9238 | 16.3507 | | soft_actor_critic | 256 | 0.2111 | 0.4295 | 0.739 | 2.043 | 12.9604 | | lennard_jones | 1000 | 0.1575 | 0.4393 | 0.6231 | 1.5084 | 8.2622 | | tts_angular | 64 | 0.2296 | 0.2948 | 0.4243 | 1.0492 | 4.2834 | | demucs | 4 | 0.3536 | 0.3523 | 0.3482 | 0.3553 | 0.2593 | | hf_GPT2_large | 4 | 5.4574 | 24.7197 | nan | nan | nan | | speech_transformer | 32 | 1.9382 | 11.4142 | nan | nan | nan | | dlrm | 2048 | 0.48 | 1.0459 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | nan | 0.7887 | 1.2758 | | hf_Albert | 8 | 0.9814 | 0.936 | nan | nan | 1.1576 | | Super_SloMo | 6 | 1.0024 | 0.9645 | nan | nan | 1.0536 | | timm_nfnet | 128 | 0.9693 | 0.8982 | nan | 0.9445 | 1.0337 | | timm_efficientdet | 1 | 1.028 | 0.8404 | nan | nan | 1.0226 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | nan | 0.9117 | 1.0074 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0002 | 0.9895 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 0.9829 | | BERT_pytorch | 16 | 1.0 | 0.8825 | nan | nan | 0.9721 | | hf_GPT2 | 4 | 0.9706 | 0.8625 | nan | nan | 0.9648 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.9309 | | timm_regnet | 32 | 0.9953 | 0.8446 | nan | 0.85 | 0.9249 | | Background_Matting | 4 | 1.0138 | 0.9624 | nan | 0.9813 | 0.9245 | | yolov3 | 16 | 0.9908 | 0.8381 | nan | 0.8244 | 0.9059 | | hf_Bert | 4 | 0.9844 | 0.8677 | nan | nan | 0.9017 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.999 | 0.8735 | 0.2638 | 0.8441 | 0.8861 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8357 | nan | 0.8494 | 0.879 | | timm_resnest | 32 | 0.9868 | 0.8711 | nan | 0.8623 | 0.8756 | | densenet121 | 4 | 0.9857 | 0.8678 | nan | 0.8376 | 0.8753 | | pytorch_unet | 1 | 0.9968 | 0.8653 | nan | 0.8496 | 0.8678 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | nan | nan | 0.8661 | | resnet50 | 32 | 0.9907 | 0.8629 | nan | 0.7995 | 0.8652 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.2916 | 0.7589 | 0.8611 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | nan | 0.8503 | 0.856 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | nan | nan | 0.8387 | | timm_vovnet | 32 | 0.9903 | 0.7678 | nan | 0.7742 | 0.8352 | | dcgan | 32 | 0.9698 | 0.7838 | 0.3394 | 0.7073 | 0.8283 | | hf_Bart | 4 | 0.9102 | 0.8321 | nan | nan | 0.8137 | | hf_BigBird | 2 | 0.9837 | 0.9784 | nan | nan | 0.8098 | | alexnet | 128 | 0.951 | 0.7753 | 0.4257 | 0.7753 | 0.7974 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | nan | 0.866 | 0.7918 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.2147 | 0.8882 | 0.7783 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | nan | 0.8176 | 0.7644 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.1723 | 0.8207 | 0.7541 | | drq | 1 | 0.9877 | 0.8312 | nan | 0.8308 | 0.752 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.2971 | 0.7172 | 0.7491 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9149 | 0.7295 | | LearningToPaint | 96 | 0.9252 | 0.7196 | nan | 0.6722 | 0.7295 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | nan | 0.8871 | 0.7151 | | resnet18 | 16 | 0.9779 | 0.7727 | nan | 0.7276 | 0.6102 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 1.0967 | 0.564 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5121 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | nan | 0.8452 | 0.4478 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.2529 | nan | 0.3629 | | speech_transformer | 32 | 1.0017 | 0.9174 | nan | nan | nan | | hf_GPT2_large | 4 | 0.9582 | 0.8645 | nan | nan | nan | | dlrm | 2048 | 0.7301 | 0.7306 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | ElectraForCausalLM | 1 | 1.0382 | 0.8408 | 0.0 | 0.0 | 7.2622 | | MobileBertForMaskedLM | 16 | 1.0136 | 0.8308 | 0.0 | 0.0 | 5.8275 | | MT5ForConditionalGeneration | 2 | 1.0195 | 0.8543 | 0.0 | 0.0 | 5.6799 | | MobileBertForQuestionAnswering | 32 | 1.018 | 0.8123 | 0.0 | 0.0 | 5.2174 | | YituTechConvBert | 1 | 1.024 | 0.8352 | 0.0 | 0.0 | 5.0508 | | MegatronBertForCausalLM | 2 | 1.0336 | 0.8383 | 0.0 | 0.0 | 4.828 | | OPTForCausalLM | 4 | 1.0147 | 0.8186 | 0.0 | 0.0 | 4.3716 | | RobertaForCausalLM | 4 | 1.0349 | 0.8428 | 0.0 | 0.0 | 4.0103 | | M2M100ForConditionalGeneration | 2 | 1.0062 | 0.8521 | 0.0 | 0.0 | 3.8744 | | CamemBert | 1 | 1.0425 | 0.8424 | 0.0 | 0.0 | 3.54 | | PegasusForConditionalGeneration | 4 | 1.0079 | 0.8177 | 0.0 | 0.0 | 3.2337 | | XGLMForCausalLM | 1 | 1.0148 | 0.8016 | 0.0 | 0.0 | 3.1114 | | DistillGPT2 | 1 | 1.0312 | 0.8664 | 0.0 | 0.0 | 2.9666 | | PLBartForConditionalGeneration | 8 | 1.0149 | 0.8248 | 0.0 | 0.0 | 2.8226 | | MegatronBertForQuestionAnswering | 8 | 1.0341 | 0.8481 | 0.0 | 0.0 | 2.702 | | MBartForConditionalGeneration | 8 | 1.0145 | 0.826 | 0.0 | 0.0 | 2.3644 | | Speech2Text2ForCausalLM | 64 | 1.0102 | 0.8122 | 0.0 | 0.0 | 2.3247 | | DistilBertForMaskedLM | 16 | 1.029 | 0.8565 | 0.0 | 0.0 | 2.171 | | GPT2ForSequenceClassification | 4 | 1.0014 | 0.9751 | 0.0 | 0.0 | 2.1475 | | ElectraForQuestionAnswering | 64 | 1.0003 | 0.9711 | 0.0 | 0.0 | 1.953 | | TrOCRForCausalLM | 8 | 1.0151 | 0.8234 | 0.0 | 0.0 | 1.925 | | DistilBertForQuestionAnswering | 32 | 1.0287 | 0.841 | 0.0 | 0.0 | 1.857 | | PegasusForCausalLM | 8 | 1.0082 | 0.7985 | 0.0 | 0.0 | 1.8519 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0144 | 0.8904 | 0.0 | 0.0 | 1.8108 | | BartForConditionalGeneration | 1 | 1.0155 | 0.8249 | 0.0 | 0.0 | 1.7844 | | LayoutLMForSequenceClassification | 16 | 1.0004 | 0.9795 | 0.0 | 0.0 | 1.7373 | | AlbertForQuestionAnswering | 2 | 1.0005 | 0.8085 | 0.0 | 0.0 | 1.6598 | | PLBartForCausalLM | 16 | 1.0139 | 0.9442 | 0.0 | 0.0 | 1.6508 | | AlbertForMaskedLM | 2 | 1.0004 | 0.8105 | 0.0 | 0.0 | 1.6425 | | T5Small | 1 | 1.0266 | 0.8772 | 0.0 | 0.0 | 1.6369 | | T5ForConditionalGeneration | 4 | 0.9987 | 0.9338 | 0.0 | 0.0 | 1.6149 | | XLNetLMHeadModel | 4 | 0.9998 | 0.9633 | 0.0 | 0.0 | 1.5997 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9711 | 0.0 | 0.0 | 1.5866 | | MBartForCausalLM | 16 | 1.0116 | 0.8169 | 0.0 | 0.0 | 1.5186 | | BartForCausalLM | 2 | 1.0024 | 0.9638 | 0.0 | 0.0 | 1.4734 | | DebertaForQuestionAnswering | 4 | 0.9316 | 0.7278 | 0.9356 | 0.0 | 1.4577 | | RobertaForQuestionAnswering | 64 | 1.0004 | 0.9687 | 0.0 | 0.0 | 1.4464 | | BertForQuestionAnswering | 64 | 1.0008 | 0.9696 | 0.0 | 0.0 | 1.4345 | | BertForMaskedLM | 64 | 1.0 | 0.9584 | 0.0 | 0.0 | 1.3189 | | BlenderbotSmallForCausalLM | 64 | 1.0021 | 0.9266 | 0.0 | 0.0 | 1.3083 | | DebertaForMaskedLM | 4 | 0.9353 | 0.7501 | 0.7988 | 0.0 | 1.222 | | BigBird | 1 | 0.9794 | 0.91 | 0.0 | 0.0 | 1.1384 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BigBird | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | CamemBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DistillGPT2 | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | ElectraForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PLBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | PegasusForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | RobertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | XLNetLMHeadModel | 4 | 3.8112 | 24.3247 | nan | nan | 316.035 | | M2M100ForConditionalGeneration | 2 | 3.5068 | 20.8405 | nan | nan | 206.999 | | MT5ForConditionalGeneration | 2 | 3.3885 | 15.9549 | nan | nan | 197.5786 | | YituTechConvBert | 1 | 2.4628 | 13.4776 | nan | nan | 197.3318 | | T5ForConditionalGeneration | 4 | 2.1304 | 10.8851 | nan | nan | 191.296 | | MobileBertForMaskedLM | 16 | 9.0497 | 40.5762 | nan | nan | 188.2942 | | MobileBertForQuestionAnswering | 32 | 9.1376 | 40.6477 | nan | nan | 170.5566 | | DebertaForMaskedLM | 4 | 4.8092 | 12.4543 | 53.8282 | nan | 166.2857 | | PegasusForConditionalGeneration | 4 | 3.3765 | 22.5318 | nan | nan | 164.7196 | | XGLMForCausalLM | 1 | 2.7194 | 17.0177 | nan | nan | 163.8919 | | T5Small | 1 | 2.1227 | 10.9165 | nan | nan | 160.2439 | | MBartForConditionalGeneration | 8 | 3.4787 | 21.8606 | nan | nan | 158.4846 | | BartForConditionalGeneration | 1 | 3.3831 | 21.2971 | nan | nan | 151.772 | | MegatronBertForCausalLM | 2 | 3.4752 | 17.8153 | nan | nan | 149.5003 | | MegatronBertForQuestionAnswering | 8 | 3.4793 | 17.5881 | nan | nan | 144.6857 | | PLBartForConditionalGeneration | 8 | 1.7497 | 11.0365 | nan | nan | 143.4586 | | DebertaForQuestionAnswering | 4 | 4.9493 | 12.5325 | 53.101 | nan | 130.9367 | | BlenderbotSmallForConditionalGeneration | 32 | 2.1805 | 14.2719 | nan | nan | 129.4778 | | RobertaForCausalLM | 4 | 1.6853 | 9.0665 | nan | nan | 105.0382 | | LayoutLMForSequenceClassification | 16 | 1.7694 | 9.1761 | nan | nan | 99.2086 | | PegasusForCausalLM | 8 | 1.3071 | 7.9972 | nan | nan | 93.7895 | | ElectraForQuestionAnswering | 64 | 1.6498 | 8.9657 | nan | nan | 92.0329 | | OPTForCausalLM | 4 | 1.3525 | 8.2425 | nan | nan | 90.2333 | | LayoutLMForMaskedLM | 16 | 1.7927 | 9.2895 | nan | nan | 90.1629 | | MBartForCausalLM | 16 | 1.2233 | 8.0907 | nan | nan | 89.171 | | BertForMaskedLM | 64 | 1.5432 | 8.6537 | nan | nan | 87.2679 | | GPT2ForSequenceClassification | 4 | 1.4535 | 7.926 | nan | nan | 84.9463 | | BartForCausalLM | 2 | 1.2831 | 7.99 | nan | nan | 82.7793 | | AlbertForMaskedLM | 2 | 1.5781 | 8.5461 | nan | nan | 77.6663 | | ElectraForCausalLM | 1 | 1.667 | 8.973 | nan | nan | 76.0295 | | DistilBertForQuestionAnswering | 32 | 0.6709 | 4.26 | nan | nan | 73.8278 | | TrOCRForCausalLM | 8 | 1.277 | 8.0625 | nan | nan | 73.0386 | | BigBird | 1 | 7.9688 | 16.8522 | nan | nan | 72.2757 | | PLBartForCausalLM | 16 | 0.6355 | 4.1232 | nan | nan | 70.9095 | | CamemBert | 1 | 1.6402 | 8.8517 | nan | nan | 69.8196 | | BlenderbotSmallForCausalLM | 64 | 0.7938 | 5.6202 | nan | nan | 68.5483 | | RobertaForQuestionAnswering | 64 | 1.5947 | 8.7628 | nan | nan | 68.5105 | | Speech2Text2ForCausalLM | 64 | 0.6924 | 4.1727 | nan | nan | 67.245 | | DistillGPT2 | 1 | 0.7462 | 3.9659 | nan | nan | 65.2046 | | BertForQuestionAnswering | 64 | 1.6715 | 8.6885 | nan | nan | 61.8158 | | DistilBertForMaskedLM | 16 | 0.6112 | 4.2202 | nan | nan | 56.1078 | | AlbertForQuestionAnswering | 2 | 1.4775 | 8.302 | nan | nan | 50.5431 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | nan | 1.0779 | | BartForCausalLM | 2 | 1.0 | 0.8769 | nan | nan | 1.0442 | | XLNetLMHeadModel | 4 | 0.9912 | 0.8791 | nan | nan | 1.0109 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | nan | nan | 1.0056 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.995 | | RobertaForQuestionAnswering | 64 | 0.9996 | 0.9315 | nan | nan | 0.9946 | | BertForQuestionAnswering | 64 | 0.9995 | 0.9315 | nan | nan | 0.9946 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | nan | nan | 0.9938 | | BartForConditionalGeneration | 1 | 1.0 | 0.8619 | nan | nan | 0.9894 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9874 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | nan | nan | 0.9871 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | nan | nan | 0.9811 | | MBartForCausalLM | 16 | 1.0 | 0.8398 | nan | nan | 0.9567 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9998 | 0.8996 | nan | nan | 0.9557 | | Speech2Text2ForCausalLM | 64 | 0.969 | 0.8488 | nan | nan | 0.9452 | | PLBartForCausalLM | 16 | 1.0001 | 0.8666 | nan | nan | 0.9395 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | | DistilBertForMaskedLM | 16 | 0.9986 | 0.8686 | nan | nan | 0.9164 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.6451 | nan | nan | 0.9124 | | AlbertForMaskedLM | 2 | 1.0 | 0.6364 | nan | nan | 0.8977 | | MBartForConditionalGeneration | 8 | 0.9999 | 0.8187 | nan | nan | 0.8861 | | TrOCRForCausalLM | 8 | 1.0 | 0.7955 | nan | nan | 0.8774 | | CamemBert | 1 | 0.9989 | 0.7872 | nan | nan | 0.8654 | | DistilBertForQuestionAnswering | 32 | 0.9992 | 0.8965 | nan | nan | 0.8639 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.8618 | | RobertaForCausalLM | 4 | 0.9237 | 0.7741 | nan | nan | 0.8574 | | OPTForCausalLM | 4 | 0.9974 | 0.75 | nan | nan | 0.8483 | | PegasusForCausalLM | 8 | 0.999 | 0.9444 | nan | nan | 0.8445 | | PLBartForConditionalGeneration | 8 | 0.9975 | 0.8294 | nan | nan | 0.8438 | | MegatronBertForQuestionAnswering | 8 | 0.9051 | 0.8218 | nan | nan | 0.8434 | | BigBird | 1 | 1.0008 | 0.9533 | nan | nan | 0.8348 | | DistillGPT2 | 1 | 0.9963 | 0.7527 | nan | nan | 0.8288 | | XGLMForCausalLM | 1 | 1.0 | 0.999 | nan | nan | 0.7913 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.7726 | nan | nan | 0.7726 | | PegasusForConditionalGeneration | 4 | 0.9994 | 0.9194 | nan | nan | 0.7686 | | M2M100ForConditionalGeneration | 2 | 1.0 | 0.9585 | nan | nan | 0.7175 | | MobileBertForMaskedLM | 16 | 0.9985 | 0.8983 | nan | nan | 0.6948 | | ElectraForCausalLM | 1 | 0.9993 | 0.8955 | nan | nan | 0.6701 | | MobileBertForQuestionAnswering | 32 | 1.0142 | 0.9796 | nan | nan | 0.6265 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | nan | nan | 0.6019 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3598 | nan | 0.4498 | | DebertaForQuestionAnswering | 4 | 0.9792 | 1.0574 | 0.3577 | nan | 0.3761 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | res2net50_14w_8s | 2 | 1.0025 | 0.8989 | 0.0 | 1.3928 | 5.4582 | | res2next50 | 2 | 1.0036 | 0.9285 | 0.0 | 1.3683 | 5.2409 | | hrnet_w18 | 2 | 1.0062 | 0.957 | 0.0 | 1.3783 | 5.0807 | | twins_pcpvt_base | 32 | 1.0041 | 0.8853 | 0.0 | 1.2699 | 2.615 | | resnest101e | 32 | 1.0044 | 0.9809 | 0.0 | 1.4177 | 2.3441 | | regnety_002 | 128 | 0.9787 | 0.9323 | 0.0 | 1.3798 | 2.1367 | | cait_m36_384 | 2 | 0.9999 | 0.8454 | 0.0 | 1.3838 | 2.0929 | | xcit_large_24_p8_224 | 5 | 0.9988 | 0.0 | 0.0 | 0.0 | 2.0773 | | tnt_s_patch16_224 | 64 | 1.0001 | 0.996 | 0.0 | 1.8886 | 2.0751 | | ghostnet_100 | 128 | 1.0033 | 0.999 | 0.0 | 1.5476 | 2.0386 | | lcnet_050 | 128 | 0.9678 | 0.9549 | 0.0 | 1.5598 | 2.0212 | | nfnet_l0 | 64 | 1.0076 | 0.8344 | 0.0 | 1.1376 | 1.7831 | | gmixer_24_224 | 64 | 0.9996 | 0.884 | 0.6424 | 1.0029 | 1.6677 | | mobilevit_s | 32 | 0.9755 | 0.7952 | 0.0 | 1.2119 | 1.656 | | dla102 | 64 | 1.0 | 0.9914 | 0.0 | 1.3805 | 1.6115 | | volo_d1_224 | 64 | 0.9996 | 0.9946 | 0.0 | 1.1413 | 1.6004 | | crossvit_9_240 | 64 | 1.006 | 0.9545 | 0.0 | 1.1212 | 1.5835 | | res2net101_26w_4s | 64 | 1.0007 | 0.9965 | 0.0 | 1.4297 | 1.5781 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9571 | 0.0 | 1.0453 | 1.521 | | gluon_inception_v3 | 128 | 0.9999 | 0.9963 | 0.0 | 1.1948 | 1.5048 | | adv_inception_v3 | 128 | 0.9999 | 0.9966 | 0.0 | 1.1948 | 1.5042 | | inception_v3 | 128 | 0.9998 | 0.9963 | 0.0 | 1.1947 | 1.5019 | | dm_nfnet_f0 | 128 | 0.9989 | 0.9997 | 0.0 | 1.1778 | 1.4949 | | mobilenetv3_large_100 | 128 | 0.9547 | 0.9449 | 0.0 | 1.3737 | 1.4633 | | resmlp_12_224 | 128 | 1.0 | 0.9988 | 0.7769 | 0.0 | 1.4475 | | selecsls42b | 128 | 0.9998 | 0.9961 | 0.0 | 1.3583 | 1.42 | | fbnetv3_b | 128 | 0.9523 | 0.9493 | 0.0 | 1.2549 | 1.4082 | | coat_lite_mini | 128 | 1.0002 | 0.9891 | 0.0 | 1.2187 | 1.4047 | | mnasnet_100 | 128 | 0.9535 | 0.944 | 0.6644 | 1.3691 | 1.4015 | | jx_nest_base | 32 | 0.9998 | 0.9932 | 0.0 | 1.2246 | 1.3932 | | gmlp_s16_224 | 64 | 0.9996 | 0.984 | 0.0 | 1.0389 | 1.3825 | | pnasnet5large | 16 | 1.006 | 1.0329 | 0.0 | 1.1815 | 1.3739 | | mobilenetv2_100 | 128 | 0.951 | 0.9414 | 0.0 | 0.8619 | 1.3718 | | spnasnet_100 | 128 | 0.9478 | 0.938 | 0.6462 | 1.3169 | 1.3674 | | pit_b_224 | 64 | 1.0 | 0.995 | 0.0 | 1.0618 | 1.3584 | | ese_vovnet19b_dw | 128 | 0.9692 | 0.9649 | 0.0 | 1.2477 | 1.3565 | | convit_base | 32 | 0.9999 | 0.9925 | 0.0 | 0.0 | 1.3442 | | fbnetc_100 | 128 | 0.9534 | 0.9432 | 0.6707 | 1.3751 | 1.3411 | | tf_efficientnet_b0 | 128 | 0.9654 | 0.8079 | 0.0 | 1.0911 | 1.3307 | | cspdarknet53 | 64 | 0.9421 | 0.9338 | 0.0 | 0.9014 | 1.3303 | | poolformer_m36 | 64 | 1.0001 | 0.9979 | 0.0 | 0.0 | 1.3282 | | botnet26t_256 | 128 | 0.9798 | 0.9749 | 0.0 | 1.3468 | 1.3078 | | tinynet_a | 128 | 0.9599 | 0.7899 | 0.0 | 1.1514 | 1.3006 | | beit_base_patch16_224 | 64 | 1.0 | 0.9783 | 0.0 | 1.0434 | 1.2865 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9908 | 0.0 | 1.0625 | 1.2819 | | rexnet_100 | 128 | 0.9644 | 0.8507 | 0.0 | 1.0373 | 1.276 | | mixnet_l | 64 | 0.9803 | 0.8889 | 0.0 | 1.0862 | 1.2668 | | eca_botnext26ts_256 | 64 | 0.9612 | 0.8004 | 0.0 | 1.1083 | 1.247 | | mixer_b16_224 | 64 | 0.9999 | 0.991 | 0.7133 | 0.9569 | 1.2469 | | visformer_small | 128 | 0.9996 | 1.002 | 0.0 | 1.0833 | 1.2395 | | tf_mixnet_l | 64 | 0.9828 | 0.8978 | 0.0 | 1.0668 | 1.2336 | | sebotnet33ts_256 | 64 | 0.9665 | 0.836 | 0.0 | 1.1165 | 1.2145 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9938 | 0.0 | 0.9929 | 1.1942 | | dpn107 | 32 | 0.9383 | 0.9315 | 0.0 | 0.9919 | 1.1809 | | gluon_xception65 | 32 | 0.9996 | 0.9895 | 0.0 | 1.0644 | 1.1611 | | repvgg_a2 | 128 | 0.9435 | 0.9342 | 0.6562 | 1.1307 | 1.1366 | | swsl_resnext101_32x16d | 32 | 0.9997 | 0.9824 | 0.0 | 1.076 | 1.1315 | | gernet_l | 128 | 0.9461 | 0.9388 | 0.0 | 1.1424 | 1.0671 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.0 | 1.0532 | 1.0557 | | convnext_base | 32 | 1.0099 | 0.9298 | 0.0 | 1.2137 | 0.7229 | | eca_halonext26ts | 64 | 0.9638 | 0.806 | 0.0 | 1.0966 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ | fbnetc_100 | 2 | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | | adv_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | | cspdarknet53 | 2 | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | fail_to_run | pass | pass | | gernet_l | 2 | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | gluon_inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | fail_to_run | pass | pass | | inception_v3 | 2 | pass | pass | fail_to_run | pass | pass | | lcnet_050 | 2 | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | fail_to_run | pass | pass | | nfnet_l0 | 2 | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | fail_to_run | pass | pass | | res2net50_14w_8s | 2 | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | | gmixer_24_224 | 2 | pass | pass | pass | fail_accuracy | pass | | gmlp_s16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | | poolformer_m36 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | resnest101e | 2 | pass | pass | fail_to_run | fail_accuracy | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | | eca_halonext26ts | 2 | pass | pass | fail_to_run | pass | fail_to_run | | gluon_xception65 | 2 | pass | pass | fail_to_run | pass | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | twins_pcpvt_base | 32 | 2.8843 | 19.2931 | nan | 72.4687 | 871.1403 | | coat_lite_mini | 128 | 1.1607 | 6.8772 | nan | 33.2129 | 807.967 | | mobilevit_s | 32 | 2.0558 | 9.7982 | nan | 58.1758 | 721.9393 | | eca_botnext26ts_256 | 64 | 1.591 | 6.1393 | nan | 63.5595 | 558.2453 | | swin_base_patch4_window7_224 | 64 | 3.0716 | 16.3405 | nan | 72.1749 | 499.9333 | | convnext_base | 32 | 1.6706 | 8.9108 | nan | 35.867 | 483.2362 | | sebotnet33ts_256 | 64 | 1.8986 | 8.6752 | nan | 68.6287 | 462.4035 | | botnet26t_256 | 128 | 1.4732 | 5.7738 | nan | 50.2327 | 409.7373 | | rexnet_100 | 128 | 2.2029 | 10.3698 | nan | 117.4054 | 387.7299 | | jx_nest_base | 32 | 1.8489 | 11.7585 | nan | 50.5117 | 383.5946 | | mixnet_l | 64 | 5.7531 | 16.3643 | nan | 81.6749 | 354.5396 | | xcit_large_24_p8_224 | 5 | 3.3798 | nan | nan | nan | 354.253 | | ghostnet_100 | 128 | 3.3561 | 13.1798 | nan | 91.6956 | 340.3388 | | resnest101e | 32 | 3.7222 | 23.4942 | nan | 101.1366 | 327.116 | | dpn107 | 32 | 4.2898 | 18.5481 | nan | 102.7943 | 325.4081 | | tf_mixnet_l | 64 | 6.0093 | 15.9106 | nan | 81.25 | 318.9435 | | pnasnet5large | 16 | 5.3468 | 31.619 | nan | 194.3739 | 316.4984 | | crossvit_9_240 | 64 | 1.8205 | 10.925 | nan | 36.5965 | 301.2294 | | hrnet_w18 | 2 | 7.2856 | 44.0062 | nan | 370.9935 | 290.2174 | | cait_m36_384 | 2 | 3.4372 | 25.239 | nan | 61.2635 | 280.0794 | | fbnetv3_b | 128 | 3.5101 | 14.9782 | nan | 101.1606 | 274.5077 | | visformer_small | 128 | 1.0126 | 5.1658 | nan | 31.0174 | 249.6162 | | volo_d1_224 | 64 | 1.4324 | 9.8537 | nan | 39.6369 | 247.3936 | | inception_v3 | 128 | 2.0488 | 13.4329 | nan | 99.4931 | 234.6812 | | gluon_inception_v3 | 128 | 2.0003 | 12.5288 | nan | 99.4546 | 234.475 | | tinynet_a | 128 | 2.3578 | 10.7604 | nan | 80.8993 | 233.3892 | | adv_inception_v3 | 128 | 2.1544 | 12.5395 | nan | 99.0938 | 233.3628 | | tf_efficientnet_b0 | 128 | 2.1054 | 8.9852 | nan | 78.7914 | 232.4157 | | dla102 | 64 | 2.1102 | 13.6918 | nan | 87.9253 | 226.2379 | | pit_b_224 | 64 | 1.1537 | 6.7464 | nan | 24.7865 | 223.8173 | | mobilenetv3_large_100 | 128 | 1.8808 | 7.692 | nan | 84.2177 | 213.0799 | | res2net50_14w_8s | 2 | 3.3958 | 20.9949 | nan | 104.6199 | 213.0645 | | convit_base | 32 | 1.2775 | 7.936 | nan | nan | 212.6967 | | res2net101_26w_4s | 64 | 3.5635 | 23.5875 | nan | 122.3856 | 202.6938 | | fbnetc_100 | 128 | 2.3018 | 9.7552 | 82.8618 | 61.6258 | 188.447 | | spnasnet_100 | 128 | 2.3247 | 9.0716 | 94.9656 | 57.8461 | 186.491 | | poolformer_m36 | 64 | 1.984 | 11.108 | nan | nan | 185.8602 | | tnt_s_patch16_224 | 64 | 2.0138 | 13.818 | nan | 38.6371 | 178.7083 | | gmlp_s16_224 | 64 | 1.3743 | 9.499 | nan | 22.3145 | 169.6309 | | mnasnet_100 | 128 | 1.9848 | 7.3755 | 59.8699 | 52.113 | 166.9227 | | cspdarknet53 | 64 | 2.6554 | 10.1594 | nan | 41.6697 | 166.3958 | | res2next50 | 2 | 1.9943 | 11.9601 | nan | 59.6704 | 164.9944 | | mobilenetv2_100 | 128 | 1.9073 | 7.6148 | nan | 41.6747 | 158.2622 | | selecsls42b | 128 | 0.8859 | 5.4934 | nan | 51.9674 | 152.6904 | | regnety_002 | 128 | 1.8571 | 7.9661 | nan | 56.8403 | 148.3253 | | gluon_xception65 | 32 | 2.406 | 15.51 | nan | 64.703 | 146.3288 | | gmixer_24_224 | 64 | 1.5741 | 10.4668 | 55.1733 | 27.6026 | 142.3327 | | dm_nfnet_f0 | 128 | 2.1876 | 8.9314 | nan | 38.4893 | 130.2856 | | ese_vovnet19b_dw | 128 | 1.1165 | 4.2735 | nan | 39.1374 | 129.6364 | | nfnet_l0 | 64 | 1.8907 | 9.0197 | nan | 34.6809 | 123.1067 | | resmlp_12_224 | 128 | 0.6251 | 4.279 | 8.2348 | nan | 119.7689 | | gernet_l | 128 | 2.2817 | 8.4993 | nan | 45.043 | 119.3916 | | lcnet_050 | 128 | 1.1419 | 4.5993 | nan | 39.1844 | 116.7487 | | mixer_b16_224 | 64 | 0.7133 | 4.7267 | 14.377 | 16.1639 | 115.9064 | | repvgg_a2 | 128 | 2.2565 | 8.1301 | 52.0702 | 63.7094 | 113.6453 | | deit_base_distilled_patch16_224 | 64 | 0.9625 | 6.1138 | nan | 14.4419 | 110.7006 | | swsl_resnext101_32x16d | 32 | 2.2174 | 13.4744 | nan | 53.9575 | 104.2414 | | beit_base_patch16_224 | 64 | 1.302 | 7.2463 | nan | 18.6805 | 102.7557 | | vit_base_patch16_224 | 64 | 0.9171 | 6.0165 | nan | 14.1213 | 79.5957 | | convmixer_768_32 | 32 | 1.4656 | 9.238 | nan | 18.6081 | 47.1118 | | eca_halonext26ts | 64 | 1.5518 | 6.5765 | nan | 67.2819 | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ | tinynet_a | 128 | 0.9889 | 0.7884 | nan | 0.7887 | 1.3707 | | gmixer_24_224 | 64 | 0.9922 | 0.9494 | 0.2212 | 0.8991 | 1.2577 | | gmlp_s16_224 | 64 | 0.9939 | 0.9623 | nan | 0.92 | 1.2405 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | nan | 0.8392 | 1.173 | | pnasnet5large | 16 | 1.0575 | 0.9913 | nan | 1.1722 | 1.1609 | | rexnet_100 | 128 | 0.9885 | 0.785 | nan | 0.8648 | 1.1475 | | mobilevit_s | 32 | 0.9926 | 0.7681 | nan | 0.787 | 1.1122 | | eca_botnext26ts_256 | 64 | 0.9888 | 0.7708 | nan | 0.7788 | 1.1081 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | nan | nan | 1.1021 | | dla102 | 64 | 0.9931 | 0.9487 | nan | 0.9751 | 1.079 | | tnt_s_patch16_224 | 64 | 0.9948 | 0.9668 | nan | 0.9431 | 1.0469 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | nan | 0.9443 | 1.0336 | | resnest101e | 32 | 0.9955 | 0.9721 | nan | 0.9532 | 1.0272 | | convit_base | 32 | 0.9972 | 0.8582 | nan | nan | 1.0248 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | 0.8587 | 1.0138 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | nan | 0.9129 | 1.0048 | | nfnet_l0 | 64 | 0.9884 | 0.8166 | nan | 0.8207 | 1.0037 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | 0.9298 | 1.0004 | | ghostnet_100 | 128 | 0.9756 | 0.87 | nan | 0.9026 | 0.9897 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | nan | 0.9714 | 0.9746 | | pit_b_224 | 64 | 0.999 | 0.8053 | nan | 0.8179 | 0.9746 | | selecsls42b | 128 | 0.9789 | 0.876 | nan | 0.8772 | 0.9715 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | nan | 0.79 | 0.9645 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | nan | 0.9146 | 0.9605 | | visformer_small | 128 | 0.9899 | 0.9259 | nan | 0.8884 | 0.9382 | | twins_pcpvt_base | 32 | 0.9938 | 0.9046 | nan | 0.8007 | 0.9335 | | tf_mixnet_l | 64 | 0.9903 | 0.8556 | nan | 0.8366 | 0.9291 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9289 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | nan | 0.8487 | 0.9112 | | dpn107 | 32 | 0.997 | 0.9097 | nan | 0.8814 | 0.9078 | | mixer_b16_224 | 64 | 0.9929 | 0.9361 | 0.2528 | 0.7726 | 0.8978 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | nan | 0.8524 | 0.8964 | | cait_m36_384 | 2 | 0.9993 | 0.8803 | nan | 0.903 | 0.8949 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | nan | 0.8641 | 0.8948 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | nan | 0.8854 | 0.8924 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | nan | 0.8801 | 0.8916 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | nan | 0.8794 | 0.8911 | | convnext_base | 32 | 1.0034 | 0.9053 | nan | 0.7521 | 0.8848 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | nan | 0.8538 | 0.8845 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | nan | 0.8538 | 0.8845 | | inception_v3 | 128 | 0.9824 | 0.8621 | nan | 0.8538 | 0.8845 | | mixnet_l | 64 | 0.99 | 0.8439 | nan | 0.7742 | 0.8647 | | gernet_l | 128 | 0.9794 | 0.8503 | nan | 0.8158 | 0.8621 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.1645 | 0.8371 | 0.8602 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | nan | 0.7908 | 0.8512 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.1662 | 0.8252 | 0.8503 | | botnet26t_256 | 128 | 0.9849 | 0.864 | nan | 0.7708 | 0.8503 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.162 | 0.7352 | 0.8387 | | hrnet_w18 | 2 | 0.9971 | 0.8333 | nan | 0.8355 | 0.8367 | | lcnet_050 | 128 | 0.9433 | 0.7566 | nan | 0.7559 | 0.8309 | | regnety_002 | 128 | 0.9504 | 0.7948 | nan | 0.7515 | 0.8245 | | res2next50 | 2 | 0.9976 | 0.8277 | nan | 0.8198 | 0.8231 | | res2net50_14w_8s | 2 | 0.9968 | 0.824 | nan | 0.8169 | 0.8228 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | nan | 0.8092 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | nan | 0.6593 | 0.7962 | | crossvit_9_240 | 64 | 0.9874 | 0.8698 | nan | 0.8854 | 0.7934 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.1439 | 0.6789 | 0.7903 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | 0.8451 | 0.7566 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | nan | 0.7354 | 0.7449 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | 0.86 | 0.6708 | | eca_halonext26ts | 64 | 0.9885 | 0.775 | nan | 0.7792 | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/stcrOIX.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/Kn9tEjS.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/Tv5xGmf.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
| inductor_no_cudagraphs | 84%, 47/56 | 91%, 40/44  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
| inductor_no_cudagraphs |   1.16x    |    1.19x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
| inductor_no_cudagraphs |   57.71    |    46.53    |    79.81    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
| inductor_no_cudagraphs |   0.93x    |    0.94x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------+------+------------------------+ | hf_Albert | 8 | 1.6654 | | hf_T5 | 8 | 1.5034 | | timm_resnest | 32 | 1.4518 | | timm_nfnet | 128 | 1.4214 | | mobilenet_v2 | 96 | 1.4114 | | BERT_pytorch | 16 | 1.3949 | | hf_GPT2 | 4 | 1.3789 | | hf_GPT2_large | 4 | 1.3631 | | shufflenet_v2_x1_0 | 128 | 1.3536 | | timm_efficientdet | 1 | 1.3438 | | fastNLP_Bert | 6 | 1.3427 | | hf_T5_large | 2 | 1.2981 | | mobilenet_v3_large | 32 | 1.2546 | | mnasnet1_0 | 32 | 1.2175 | | timm_vision_transformer | 8 | 1.1952 | | squeezenet1_1 | 32 | 1.1897 | | pytorch_unet | 1 | 1.1867 | | resnet50 | 32 | 1.1683 | | vgg16 | 64 | 1.1661 | | alexnet | 128 | 1.1658 | | Super_SloMo | 6 | 1.1649 | | hf_DistilBert | 8 | 1.155 | | LearningToPaint | 96 | 1.1542 | | resnet18 | 16 | 1.1464 | | densenet121 | 4 | 1.1455 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.1455 | | hf_Reformer | 4 | 1.1294 | | resnext50_32x4d | 8 | 1.1177 | | hf_Bert | 4 | 1.1118 | | Background_Matting | 4 | 1.1066 | | timm_efficientnet | 32 | 1.1036 | | hf_Bart | 4 | 1.1016 | | timm_regnet | 32 | 1.094 | | functorch_dp_cifar10 | 64 | 1.0896 | | pytorch_stargan | 16 | 1.0889 | | pytorch_struct | 200 | 1.0721 | | yolov3 | 16 | 1.0642 | | timm_vision_transformer_large | 8 | 1.0381 | | dcgan | 32 | 1.0299 | | attention_is_all_you_need_pytorch | 256 | 1.0285 | | hf_BigBird | 2 | 1.0105 | | timm_vovnet | 32 | 1.0014 | | tts_angular | 64 | 1.0006 | | demucs | 4 | 1.0001 | | drq | 1 | 0.9774 | | nvidia_deeprecommender | 256 | 0.9641 | | lennard_jones | 1000 | 0.854 | | soft_actor_critic | 256 | 0.821 | | resnet50_quantized_qat | 0 | 0.0 | | dlrm | 0 | 0.0 | | detectron2_fcos_r_50_fpn | 0 | 0.0 | | tacotron2 | 0 | 0.0 | | hf_Longformer | 0 | 0.0 | | speech_transformer | 0 | 0.0 | | moco | 0 | 0.0 | | mobilenet_v2_quantized_qat | 0 | 0.0 | +-----------------------------------+------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------+-----+------------------------+ | hf_T5_large | 2 | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | | hf_GPT2_large | 2 | pass_due_to_skip | | BERT_pytorch | 2 | pass | | shufflenet_v2_x1_0 | 2 | pass | | mobilenet_v3_large | 2 | pass | | nvidia_deeprecommender | 2 | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | | pytorch_stargan | 16 | pass | | pytorch_struct | 200 | pass | | pytorch_unet | 2 | pass | | resnet18 | 2 | pass | | resnet50 | 2 | pass | | resnext50_32x4d | 2 | pass | | soft_actor_critic | 256 | pass | | mobilenet_v2 | 2 | pass | | squeezenet1_1 | 2 | pass | | timm_efficientdet | 2 | pass | | timm_efficientnet | 2 | pass | | timm_nfnet | 2 | pass | | timm_regnet | 2 | pass | | timm_resnest | 2 | pass | | timm_vision_transformer | 2 | pass | | timm_vovnet | 2 | pass | | tts_angular | 2 | pass | | vgg16 | 2 | pass | | Background_Matting | 4 | pass | | yolov3 | 2 | pass | | mnasnet1_0 | 2 | pass | | densenet121 | 2 | pass | | drq | 1 | pass | | demucs | 4 | pass | | fastNLP_Bert | 2 | pass | | lennard_jones | 2 | pass | | functorch_dp_cifar10 | 2 | pass | | hf_Albert | 2 | pass | | hf_Bart | 2 | pass | | dcgan | 2 | pass | | hf_Bert | 2 | pass | | hf_BigBird | 2 | pass | | hf_DistilBert | 2 | pass | | hf_GPT2 | 2 | pass | | attention_is_all_you_need_pytorch | 2 | pass | | hf_Reformer | 2 | pass | | alexnet | 2 | pass | | Super_SloMo | 2 | pass | | LearningToPaint | 2 | pass | | dlrm | 2 | pass | | speech_transformer | 2 | fail_to_run | | tacotron2 | 2 | fail_to_run | | resnet50_quantized_qat | 2 | fail_to_run | | hf_Longformer | 2 | fail_to_run | | moco | 2 | fail_to_run | | mobilenet_v2_quantized_qat | 2 | fail_to_run | | hf_T5 | 2 | fail_accuracy | | hf_T5_base | 2 | fail_accuracy | | detectron2_fcos_r_50_fpn | 0 | 0.0000 | | vision_maskrcnn | 0 | 0.0000 | +-----------------------------------+-----+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------+------+------------------------+ | timm_efficientdet | 1 | 468.7017 | | yolov3 | 16 | 412.4742 | | hf_T5_large | 2 | 201.9938 | | attention_is_all_you_need_pytorch | 256 | 139.2113 | | hf_GPT2_large | 4 | 135.7083 | | timm_vision_transformer | 8 | 132.5885 | | timm_resnest | 32 | 130.3164 | | pytorch_stargan | 16 | 109.0045 | | timm_vision_transformer_large | 8 | 101.2999 | | pytorch_struct | 200 | 96.8387 | | BERT_pytorch | 16 | 91.374 | | fastNLP_Bert | 6 | 62.5788 | | hf_GPT2 | 4 | 58.9256 | | hf_Bart | 4 | 48.2588 | | hf_T5 | 8 | 44.4672 | | densenet121 | 4 | 43.6522 | | hf_Albert | 8 | 41.0064 | | mobilenet_v3_large | 32 | 30.7343 | | mnasnet1_0 | 32 | 29.903 | | hf_Bert | 4 | 29.4868 | | resnext50_32x4d | 8 | 29.068 | | hf_Reformer | 4 | 28.857 | | timm_nfnet | 128 | 27.7243 | | functorch_dp_cifar10 | 64 | 25.0023 | | hf_BigBird | 2 | 24.174 | | resnet18 | 16 | 22.149 | | timm_regnet | 32 | 19.9962 | | timm_efficientnet | 32 | 18.8884 | | hf_DistilBert | 8 | 17.2316 | | shufflenet_v2_x1_0 | 128 | 16.8719 | | Super_SloMo | 6 | 15.9453 | | Background_Matting | 4 | 15.7895 | | mobilenet_v2 | 96 | 15.5507 | | timm_vovnet | 32 | 14.3515 | | resnet50 | 32 | 14.0671 | | pytorch_unet | 1 | 7.6953 | | pytorch_CycleGAN_and_pix2pix | 1 | 7.6134 | | LearningToPaint | 96 | 6.523 | | squeezenet1_1 | 32 | 3.2894 | | nvidia_deeprecommender | 256 | 3.2179 | | drq | 1 | 2.7713 | | vgg16 | 64 | 2.647 | | alexnet | 128 | 2.1583 | | soft_actor_critic | 256 | 2.1351 | | dcgan | 32 | 2.1126 | | lennard_jones | 1000 | 1.3283 | | tts_angular | 64 | 1.1874 | | demucs | 4 | 0.1992 | | detectron2_fcos_r_50_fpn | 0 | nan | | dlrm | 0 | nan | | hf_Longformer | 0 | nan | | mobilenet_v2_quantized_qat | 0 | nan | | moco | 0 | nan | | resnet50_quantized_qat | 0 | nan | | speech_transformer | 0 | nan | | tacotron2 | 0 | nan | +-----------------------------------+------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------+------+------------------------+ | timm_efficientnet | 32 | 1.3377 | | hf_Albert | 8 | 1.1942 | | Super_SloMo | 6 | 1.1913 | | hf_T5 | 8 | 1.1507 | | timm_efficientdet | 1 | 1.1428 | | squeezenet1_1 | 32 | 1.1267 | | mobilenet_v2 | 96 | 1.1105 | | hf_Bart | 4 | 1.0962 | | hf_GPT2_large | 4 | 1.0941 | | hf_GPT2 | 4 | 1.0819 | | fastNLP_Bert | 6 | 1.0755 | | BERT_pytorch | 16 | 1.0689 | | timm_nfnet | 128 | 1.0495 | | hf_BigBird | 2 | 1.0404 | | shufflenet_v2_x1_0 | 128 | 1.0072 | | soft_actor_critic | 256 | 0.9991 | | lennard_jones | 1000 | 0.9989 | | pytorch_stargan | 16 | 0.9928 | | demucs | 4 | 0.9886 | | tts_angular | 64 | 0.9884 | | hf_Reformer | 4 | 0.9882 | | timm_vision_transformer_large | 8 | 0.9823 | | timm_resnest | 32 | 0.9688 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9646 | | attention_is_all_you_need_pytorch | 256 | 0.9432 | | timm_regnet | 32 | 0.9323 | | densenet121 | 4 | 0.9307 | | yolov3 | 16 | 0.9271 | | Background_Matting | 4 | 0.9164 | | hf_Bert | 4 | 0.9017 | | mobilenet_v3_large | 32 | 0.8964 | | resnet50 | 32 | 0.8913 | | drq | 1 | 0.8778 | | mnasnet1_0 | 32 | 0.8659 | | pytorch_unet | 1 | 0.8608 | | hf_DistilBert | 8 | 0.8605 | | resnext50_32x4d | 8 | 0.8352 | | alexnet | 128 | 0.8332 | | timm_vovnet | 32 | 0.8316 | | hf_T5_large | 2 | 0.796 | | dcgan | 32 | 0.7903 | | timm_vision_transformer | 8 | 0.7779 | | LearningToPaint | 96 | 0.7462 | | resnet18 | 16 | 0.7049 | | vgg16 | 64 | 0.6497 | | nvidia_deeprecommender | 256 | 0.5598 | | pytorch_struct | 200 | 0.429 | | functorch_dp_cifar10 | 64 | 0.4212 | | detectron2_fcos_r_50_fpn | 0 | nan | | dlrm | 0 | nan | | hf_Longformer | 0 | nan | | mobilenet_v2_quantized_qat | 0 | nan | | moco | 0 | nan | | resnet50_quantized_qat | 0 | nan | | speech_transformer | 0 | nan | | tacotron2 | 0 | nan | +-----------------------------------+------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------------+----+------------------------+ | MT5ForConditionalGeneration | 2 | 1.7847 | | GPT2ForSequenceClassification | 4 | 1.6428 | | XLNetLMHeadModel | 4 | 1.4402 | | DistillGPT2 | 1 | 1.4396 | | T5ForConditionalGeneration | 4 | 1.4351 | | OPTForCausalLM | 4 | 1.3587 | | M2M100ForConditionalGeneration | 2 | 1.3445 | | ElectraForQuestionAnswering | 64 | 1.3409 | | CamemBert | 1 | 1.3157 | | MegatronBertForCausalLM | 2 | 1.2953 | | AlbertForQuestionAnswering | 2 | 1.295 | | AlbertForMaskedLM | 2 | 1.2919 | | YituTechConvBert | 1 | 1.2885 | | ElectraForCausalLM | 1 | 1.288 | | RobertaForCausalLM | 4 | 1.2846 | | MobileBertForQuestionAnswering | 32 | 1.2589 | | MobileBertForMaskedLM | 16 | 1.2572 | | PLBartForConditionalGeneration | 8 | 1.2468 | | XGLMForCausalLM | 1 | 1.2433 | | LayoutLMForSequenceClassification | 16 | 1.236 | | PegasusForConditionalGeneration | 4 | 1.2286 | | MegatronBertForQuestionAnswering | 8 | 1.2086 | | MBartForConditionalGeneration | 8 | 1.1866 | | LayoutLMForMaskedLM | 16 | 1.1736 | | TrOCRForCausalLM | 8 | 1.1511 | | Speech2Text2ForCausalLM | 64 | 1.1447 | | PegasusForCausalLM | 8 | 1.1282 | | BartForConditionalGeneration | 1 | 1.1087 | | BartForCausalLM | 2 | 1.108 | | DistilBertForQuestionAnswering | 32 | 1.1077 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0999 | | DistilBertForMaskedLM | 16 | 1.0853 | | PLBartForCausalLM | 16 | 1.0841 | | RobertaForQuestionAnswering | 64 | 1.0745 | | BertForQuestionAnswering | 64 | 1.0714 | | DebertaForMaskedLM | 4 | 1.0529 | | T5Small | 1 | 1.0504 | | DebertaForQuestionAnswering | 4 | 1.0466 | | BertForMaskedLM | 64 | 1.041 | | BlenderbotSmallForCausalLM | 64 | 1.0391 | | MBartForCausalLM | 16 | 1.0141 | | BigBird | 1 | 0.9716 | | GoogleFnet | 1 | 0.9328 | | AllenaiLongformerBase | 0 | 0.0 | +-----------------------------------------+----+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------------+----+------------------------+ | AlbertForMaskedLM | 1 | pass | | LayoutLMForSequenceClassification | 1 | pass | | MBartForCausalLM | 1 | pass | | MegatronBertForCausalLM | 1 | pass | | MegatronBertForQuestionAnswering | 1 | pass | | MobileBertForMaskedLM | 1 | pass | | MobileBertForQuestionAnswering | 1 | pass | | OPTForCausalLM | 1 | pass | | PLBartForCausalLM | 1 | pass | | PegasusForCausalLM | 1 | pass | | PegasusForConditionalGeneration | 1 | pass | | RobertaForCausalLM | 1 | pass | | RobertaForQuestionAnswering | 1 | pass | | Speech2Text2ForCausalLM | 1 | pass | | T5ForConditionalGeneration | 1 | pass | | T5Small | 1 | pass | | TrOCRForCausalLM | 1 | pass | | XGLMForCausalLM | 1 | pass | | XLNetLMHeadModel | 1 | pass | | AlbertForQuestionAnswering | 1 | pass | | M2M100ForConditionalGeneration | 1 | pass | | LayoutLMForMaskedLM | 1 | pass | | GoogleFnet | 1 | pass | | BartForCausalLM | 1 | pass | | BartForConditionalGeneration | 1 | pass | | BertForMaskedLM | 1 | pass | | BertForQuestionAnswering | 1 | pass | | BigBird | 1 | pass | | BlenderbotSmallForCausalLM | 1 | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | | CamemBert | 1 | pass | | DebertaForMaskedLM | 1 | pass | | YituTechConvBert | 1 | pass | | DebertaForQuestionAnswering | 1 | pass | | DistilBertForMaskedLM | 1 | pass | | DistilBertForQuestionAnswering | 1 | pass | | DistillGPT2 | 1 | pass | | ElectraForCausalLM | 1 | pass | | ElectraForQuestionAnswering | 1 | pass | | GPT2ForSequenceClassification | 1 | pass | | MBartForConditionalGeneration | 1 | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | | PLBartForConditionalGeneration | 1 | fail_to_run | | MT5ForConditionalGeneration | 1 | fail_accuracy | +-----------------------------------------+----+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------------+----+------------------------+ | XLNetLMHeadModel | 4 | 291.777 | | YituTechConvBert | 1 | 117.2838 | | MT5ForConditionalGeneration | 2 | 104.155 | | MobileBertForMaskedLM | 16 | 92.6544 | | M2M100ForConditionalGeneration | 2 | 78.3477 | | MobileBertForQuestionAnswering | 32 | 78.1561 | | PegasusForConditionalGeneration | 4 | 67.836 | | PLBartForConditionalGeneration | 8 | 66.5782 | | MBartForConditionalGeneration | 8 | 62.8985 | | MegatronBertForCausalLM | 2 | 62.8671 | | T5ForConditionalGeneration | 4 | 56.6793 | | XGLMForCausalLM | 1 | 55.1411 | | DebertaForMaskedLM | 4 | 54.5502 | | MegatronBertForQuestionAnswering | 8 | 54.1915 | | RobertaForCausalLM | 4 | 53.6744 | | T5Small | 1 | 52.4961 | | BlenderbotSmallForConditionalGeneration | 32 | 46.6959 | | BartForConditionalGeneration | 1 | 46.0636 | | LayoutLMForSequenceClassification | 16 | 45.1752 | | PegasusForCausalLM | 8 | 37.3648 | | MBartForCausalLM | 16 | 36.0993 | | DistillGPT2 | 1 | 32.4988 | | OPTForCausalLM | 4 | 32.2822 | | TrOCRForCausalLM | 8 | 30.9952 | | BertForMaskedLM | 64 | 30.7106 | | ElectraForQuestionAnswering | 64 | 30.3021 | | LayoutLMForMaskedLM | 16 | 29.6613 | | GPT2ForSequenceClassification | 4 | 29.4466 | | DistilBertForQuestionAnswering | 32 | 28.5283 | | DebertaForQuestionAnswering | 4 | 26.7377 | | BartForCausalLM | 2 | 26.6824 | | AlbertForMaskedLM | 2 | 25.7399 | | BigBird | 1 | 23.9995 | | PLBartForCausalLM | 16 | 22.3645 | | Speech2Text2ForCausalLM | 64 | 21.9017 | | BlenderbotSmallForCausalLM | 64 | 21.2607 | | ElectraForCausalLM | 1 | 20.367 | | CamemBert | 1 | 20.0012 | | DistilBertForMaskedLM | 16 | 19.3193 | | RobertaForQuestionAnswering | 64 | 16.5112 | | BertForQuestionAnswering | 64 | 16.3605 | | AlbertForQuestionAnswering | 2 | 15.2434 | | GoogleFnet | 1 | 13.2948 | | AllenaiLongformerBase | 0 | nan | +-----------------------------------------+----+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+------------------------+ | name | bs | inductor_no_cudagraphs | +-----------------------------------------+----+------------------------+ | DebertaForQuestionAnswering | 4 | 1.13 | | T5ForConditionalGeneration | 4 | 1.1049 | | GPT2ForSequenceClassification | 4 | 1.0911 | | T5Small | 1 | 1.0758 | | DebertaForMaskedLM | 4 | 1.0346 | | BigBird | 1 | 1.0115 | | M2M100ForConditionalGeneration | 2 | 1.005 | | BertForQuestionAnswering | 64 | 1.0032 | | RobertaForQuestionAnswering | 64 | 1.0032 | | ElectraForQuestionAnswering | 64 | 1.0025 | | XGLMForCausalLM | 1 | 0.9999 | | LayoutLMForSequenceClassification | 16 | 0.9827 | | BartForConditionalGeneration | 1 | 0.9819 | | PegasusForConditionalGeneration | 4 | 0.9769 | | XLNetLMHeadModel | 4 | 0.9717 | | AlbertForQuestionAnswering | 2 | 0.9674 | | TrOCRForCausalLM | 8 | 0.9625 | | PegasusForCausalLM | 8 | 0.9625 | | AlbertForMaskedLM | 2 | 0.9567 | | DistilBertForQuestionAnswering | 32 | 0.9481 | | MBartForConditionalGeneration | 8 | 0.9416 | | LayoutLMForMaskedLM | 16 | 0.9409 | | GoogleFnet | 1 | 0.9366 | | PLBartForConditionalGeneration | 8 | 0.9331 | | BartForCausalLM | 2 | 0.9329 | | DistillGPT2 | 1 | 0.93 | | MegatronBertForQuestionAnswering | 8 | 0.923 | | BertForMaskedLM | 64 | 0.922 | | MBartForCausalLM | 16 | 0.9194 | | DistilBertForMaskedLM | 16 | 0.9137 | | BlenderbotSmallForConditionalGeneration | 32 | 0.913 | | YituTechConvBert | 1 | 0.9068 | | PLBartForCausalLM | 16 | 0.903 | | OPTForCausalLM | 4 | 0.898 | | RobertaForCausalLM | 4 | 0.8927 | | Speech2Text2ForCausalLM | 64 | 0.889 | | CamemBert | 1 | 0.8656 | | BlenderbotSmallForCausalLM | 64 | 0.8452 | | MobileBertForMaskedLM | 16 | 0.8035 | | MegatronBertForCausalLM | 2 | 0.7066 | | ElectraForCausalLM | 1 | 0.7024 | | MobileBertForQuestionAnswering | 32 | 0.6097 | | MT5ForConditionalGeneration | 2 | 0.5416 | | AllenaiLongformerBase | 0 | nan | +-----------------------------------------+----+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+------------------------+ | name | bs | inductor_no_cudagraphs | +---------------------------------+-----+------------------------+ | ghostnet_100 | 128 | 1.7407 | | lcnet_050 | 128 | 1.6645 | | coat_lite_mini | 128 | 1.6009 | | tnt_s_patch16_224 | 64 | 1.5205 | | dm_nfnet_f0 | 128 | 1.4216 | | volo_d1_224 | 64 | 1.3664 | | mobilenetv2_100 | 128 | 1.3539 | | mobilenetv3_large_100 | 128 | 1.3468 | | xcit_large_24_p8_224 | 5 | 1.3348 | | dla102 | 64 | 1.3235 | | gmixer_24_224 | 64 | 1.3207 | | nfnet_l0 | 64 | 1.3138 | | adv_inception_v3 | 128 | 1.3083 | | gluon_inception_v3 | 128 | 1.3077 | | inception_v3 | 128 | 1.3074 | | cspdarknet53 | 64 | 1.3038 | | crossvit_9_240 | 64 | 1.3 | | fbnetv3_b | 128 | 1.2955 | | mnasnet_100 | 128 | 1.2817 | | sebotnet33ts_256 | 64 | 1.2795 | | botnet26t_256 | 128 | 1.2675 | | tf_efficientnet_b0 | 128 | 1.264 | | fbnetc_100 | 128 | 1.2639 | | resnest101e | 32 | 1.2607 | | spnasnet_100 | 128 | 1.2556 | | jx_nest_base | 32 | 1.2507 | | regnety_002 | 128 | 1.2382 | | convit_base | 32 | 1.2372 | | selecsls42b | 128 | 1.231 | | ese_vovnet19b_dw | 128 | 1.2247 | | eca_botnext26ts_256 | 64 | 1.222 | | rexnet_100 | 128 | 1.2202 | | eca_halonext26ts | 64 | 1.2158 | | pit_b_224 | 64 | 1.2154 | | tinynet_a | 128 | 1.2002 | | convnext_base | 32 | 1.1952 | | pnasnet5large | 16 | 1.1934 | | dpn107 | 32 | 1.1914 | | res2net101_26w_4s | 64 | 1.191 | | twins_pcpvt_base | 32 | 1.1799 | | repvgg_a2 | 128 | 1.1677 | | cait_m36_384 | 2 | 1.1573 | | tf_mixnet_l | 64 | 1.1473 | | poolformer_m36 | 64 | 1.1472 | | hrnet_w18 | 2 | 1.141 | | swin_base_patch4_window7_224 | 64 | 1.1349 | | gmlp_s16_224 | 64 | 1.1345 | | mobilevit_s | 32 | 1.131 | | mixnet_l | 64 | 1.1293 | | res2net50_14w_8s | 2 | 1.1143 | | beit_base_patch16_224 | 64 | 1.1089 | | res2next50 | 2 | 1.1049 | | deit_base_distilled_patch16_224 | 64 | 1.0897 | | vit_base_patch16_224 | 64 | 1.0779 | | gluon_xception65 | 32 | 1.0753 | | convmixer_768_32 | 32 | 1.0737 | | swsl_resnext101_32x16d | 32 | 1.0725 | | gernet_l | 128 | 1.072 | | mixer_b16_224 | 64 | 1.0324 | | visformer_small | 128 | 1.0173 | | resmlp_12_224 | 128 | 0.9751 | +---------------------------------+-----+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+------------------------+ | name | bs | inductor_no_cudagraphs | +---------------------------------+----+------------------------+ | adv_inception_v3 | 2 | pass | | beit_base_patch16_224 | 2 | pass | | mobilenetv2_100 | 2 | pass | | mobilenetv3_large_100 | 2 | pass | | mobilevit_s | 2 | pass | | nfnet_l0 | 2 | pass | | pit_b_224 | 2 | pass | | pnasnet5large | 2 | pass | | poolformer_m36 | 2 | pass | | regnety_002 | 2 | pass | | repvgg_a2 | 2 | pass | | res2net101_26w_4s | 2 | pass | | res2net50_14w_8s | 2 | pass | | res2next50 | 2 | pass | | resmlp_12_224 | 2 | pass | | rexnet_100 | 2 | pass | | sebotnet33ts_256 | 2 | pass | | selecsls42b | 2 | pass | | spnasnet_100 | 2 | pass | | swin_base_patch4_window7_224 | 2 | pass | | swsl_resnext101_32x16d | 2 | pass | | tf_efficientnet_b0 | 2 | pass | | tf_mixnet_l | 2 | pass | | tinynet_a | 2 | pass | | tnt_s_patch16_224 | 2 | pass | | twins_pcpvt_base | 2 | pass | | visformer_small | 2 | pass | | vit_base_patch16_224 | 2 | pass | | volo_d1_224 | 2 | pass | | mnasnet_100 | 2 | pass | | mixnet_l | 2 | pass | | mixer_b16_224 | 2 | pass | | lcnet_050 | 2 | pass | | botnet26t_256 | 2 | pass | | cait_m36_384 | 2 | pass | | coat_lite_mini | 2 | pass | | convit_base | 2 | pass | | convmixer_768_32 | 2 | pass | | convnext_base | 2 | pass | | crossvit_9_240 | 2 | pass | | cspdarknet53 | 2 | pass | | dla102 | 2 | pass | | dm_nfnet_f0 | 2 | pass | | dpn107 | 2 | pass | | eca_botnext26ts_256 | 2 | pass | | eca_halonext26ts | 2 | pass | | ese_vovnet19b_dw | 2 | pass | | fbnetc_100 | 2 | pass | | gernet_l | 2 | pass | | ghostnet_100 | 2 | pass | | gluon_inception_v3 | 2 | pass | | gluon_xception65 | 2 | pass | | gmixer_24_224 | 2 | pass | | gmlp_s16_224 | 2 | pass | | hrnet_w18 | 2 | pass | | inception_v3 | 2 | pass | | jx_nest_base | 2 | pass | | xcit_large_24_p8_224 | 2 | pass | | resnest101e | 2 | fail_accuracy | | fbnetv3_b | 2 | fail_accuracy | | deit_base_distilled_patch16_224 | 2 | fail_accuracy | +---------------------------------+----+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+------------------------+ | name | bs | inductor_no_cudagraphs | +---------------------------------+-----+------------------------+ | twins_pcpvt_base | 32 | 555.6571 | | coat_lite_mini | 128 | 359.4672 | | mobilevit_s | 32 | 340.027 | | eca_botnext26ts_256 | 64 | 282.2192 | | eca_halonext26ts | 64 | 259.4407 | | convnext_base | 32 | 228.0653 | | swin_base_patch4_window7_224 | 64 | 172.1811 | | cait_m36_384 | 2 | 164.114 | | xcit_large_24_p8_224 | 5 | 163.5473 | | crossvit_9_240 | 64 | 154.7684 | | jx_nest_base | 32 | 134.0668 | | sebotnet33ts_256 | 64 | 129.4275 | | resnest101e | 32 | 104.9281 | | botnet26t_256 | 128 | 102.8117 | | gmlp_s16_224 | 64 | 92.6649 | | hrnet_w18 | 2 | 84.8777 | | convit_base | 32 | 81.1096 | | volo_d1_224 | 64 | 72.6098 | | gmixer_24_224 | 64 | 69.2979 | | visformer_small | 128 | 68.4914 | | pnasnet5large | 16 | 68.4373 | | pit_b_224 | 64 | 63.5178 | | tnt_s_patch16_224 | 64 | 61.7856 | | res2net101_26w_4s | 64 | 51.385 | | res2net50_14w_8s | 2 | 43.1844 | | poolformer_m36 | 64 | 43.1124 | | mixer_b16_224 | 64 | 38.2836 | | dpn107 | 32 | 36.985 | | resmlp_12_224 | 128 | 36.3702 | | deit_base_distilled_patch16_224 | 64 | 32.9804 | | fbnetv3_b | 128 | 32.3981 | | adv_inception_v3 | 128 | 31.3202 | | gluon_inception_v3 | 128 | 30.7326 | | inception_v3 | 128 | 30.5649 | | tf_mixnet_l | 64 | 30.4171 | | gluon_xception65 | 32 | 30.2167 | | ghostnet_100 | 128 | 29.5224 | | dla102 | 64 | 29.5182 | | mixnet_l | 64 | 29.4533 | | beit_base_patch16_224 | 64 | 29.1199 | | dm_nfnet_f0 | 128 | 25.1402 | | swsl_resnext101_32x16d | 32 | 24.8712 | | rexnet_100 | 128 | 24.167 | | res2next50 | 2 | 24.1215 | | tinynet_a | 128 | 23.0467 | | vit_base_patch16_224 | 64 | 22.4942 | | tf_efficientnet_b0 | 128 | 21.1541 | | cspdarknet53 | 64 | 21.0723 | | nfnet_l0 | 64 | 21.023 | | fbnetc_100 | 128 | 19.5932 | | spnasnet_100 | 128 | 19.2389 | | convmixer_768_32 | 32 | 18.1042 | | mobilenetv3_large_100 | 128 | 17.3766 | | regnety_002 | 128 | 16.596 | | mobilenetv2_100 | 128 | 16.3604 | | repvgg_a2 | 128 | 16.16 | | gernet_l | 128 | 16.0605 | | mnasnet_100 | 128 | 15.9917 | | selecsls42b | 128 | 14.7163 | | ese_vovnet19b_dw | 128 | 11.7094 | | lcnet_050 | 128 | 11.0843 | +---------------------------------+-----+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+------------------------+ | name | bs | inductor_no_cudagraphs | +---------------------------------+-----+------------------------+ | gmixer_24_224 | 64 | 1.4405 | | tinynet_a | 128 | 1.3692 | | pnasnet5large | 16 | 1.3282 | | nfnet_l0 | 64 | 1.3209 | | rexnet_100 | 128 | 1.2765 | | convit_base | 32 | 1.2244 | | eca_botnext26ts_256 | 64 | 1.2041 | | eca_halonext26ts | 64 | 1.2034 | | tf_efficientnet_b0 | 128 | 1.199 | | mobilevit_s | 32 | 1.1987 | | mobilenetv2_100 | 128 | 1.1104 | | cait_m36_384 | 2 | 1.0986 | | ghostnet_100 | 128 | 1.0963 | | tf_mixnet_l | 64 | 1.0815 | | poolformer_m36 | 64 | 1.069 | | dla102 | 64 | 1.0544 | | dm_nfnet_f0 | 128 | 1.0495 | | selecsls42b | 128 | 1.0324 | | mixnet_l | 64 | 1.0059 | | xcit_large_24_p8_224 | 5 | 1.0039 | | resnest101e | 32 | 1.002 | | ese_vovnet19b_dw | 128 | 0.9967 | | vit_base_patch16_224 | 64 | 0.9873 | | swin_base_patch4_window7_224 | 64 | 0.9871 | | pit_b_224 | 64 | 0.9866 | | tnt_s_patch16_224 | 64 | 0.986 | | convmixer_768_32 | 32 | 0.9853 | | mixer_b16_224 | 64 | 0.9851 | | coat_lite_mini | 128 | 0.9838 | | deit_base_distilled_patch16_224 | 64 | 0.9831 | | beit_base_patch16_224 | 64 | 0.982 | | fbnetv3_b | 128 | 0.977 | | jx_nest_base | 32 | 0.9714 | | sebotnet33ts_256 | 64 | 0.9712 | | hrnet_w18 | 2 | 0.9689 | | twins_pcpvt_base | 32 | 0.9634 | | dpn107 | 32 | 0.9562 | | res2net101_26w_4s | 64 | 0.9547 | | visformer_small | 128 | 0.951 | | crossvit_9_240 | 64 | 0.944 | | gluon_xception65 | 32 | 0.9376 | | gmlp_s16_224 | 64 | 0.9324 | | res2net50_14w_8s | 2 | 0.9317 | | res2next50 | 2 | 0.9281 | | swsl_resnext101_32x16d | 32 | 0.9249 | | convnext_base | 32 | 0.9239 | | lcnet_050 | 128 | 0.923 | | volo_d1_224 | 64 | 0.9172 | | spnasnet_100 | 128 | 0.9157 | | mobilenetv3_large_100 | 128 | 0.9126 | | mnasnet_100 | 128 | 0.9077 | | adv_inception_v3 | 128 | 0.9073 | | gluon_inception_v3 | 128 | 0.9073 | | inception_v3 | 128 | 0.9073 | | regnety_002 | 128 | 0.8993 | | cspdarknet53 | 64 | 0.8875 | | botnet26t_256 | 128 | 0.8702 | | fbnetc_100 | 128 | 0.8498 | | resmlp_12_224 | 128 | 0.8253 | | gernet_l | 128 | 0.8234 | | repvgg_a2 | 128 | 0.8011 | +---------------------------------+-----+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/lPUreEG.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/WoDkZ4n.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/L1k7ZaK.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 89%, 49/55 | 98%, 43/44  | 100%, 61/61 |
|       aot_eager        | 87%, 48/55 | 98%, 43/44  | 90%, 55/61  |
|     aot_cudagraphs     | 73%, 40/55 | 57%, 25/44  | 56%, 34/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/44   | 82%, 50/61  |
|        inductor        | 87%, 48/55 | 93%, 41/44  | 97%, 59/61  |
| inductor_no_cudagraphs | 89%, 49/55 | 93%, 41/44  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.02x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.14x    |    1.07x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.12x    |
|        inductor        |   1.49x    |    1.64x    |    1.34x    |
| inductor_no_cudagraphs |   1.23x    |    1.32x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.73    |    2.18     |    1.87     |
|       aot_eager        |    6.15    |    9.08     |    8.21     |
|     aot_cudagraphs     |    6.35    |    11.31    |    16.66    |
|      aot_nvfuser       |   20.10    |    9.46     |    48.56    |
|        inductor        |   58.49    |    50.41    |    80.71    |
| inductor_no_cudagraphs |   25.61    |    23.48    |    27.66    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.98x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.84x    |    0.77x    |    0.95x    |
| inductor_no_cudagraphs |   0.98x    |    0.95x    |    1.03x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | BERT_pytorch | 16 | 1.0194 | 0.879 | 0.0 | 0.0 | 1.9623 | 1.9497 | | hf_Albert | 8 | 1.0014 | 0.9975 | 0.7449 | 0.0 | 1.6638 | 1.6536 | | hf_T5_large | 2 | 1.0247 | 0.898 | 0.0 | 0.0 | 1.622 | 1.5824 | | hf_T5 | 8 | 1.0013 | 0.9935 | 0.0 | 0.0 | 0.0 | 1.5529 | | timm_efficientdet | 1 | 0.9857 | 0.8892 | 0.0 | 0.0 | 4.3437 | 1.5417 | | speech_transformer | 32 | 1.0172 | 0.8978 | 0.0 | 0.0 | 1.5546 | 1.5389 | | timm_resnest | 32 | 0.9993 | 1.0018 | 0.8046 | 1.1825 | 1.522 | 1.4524 | | hf_GPT2 | 4 | 1.0053 | 0.9782 | 0.7229 | 0.0 | 1.4991 | 1.4387 | | timm_nfnet | 128 | 0.9996 | 0.9997 | 0.0 | 1.2121 | 1.4686 | 1.422 | | timm_vision_transformer | 8 | 1.0039 | 0.9315 | 1.5194 | 1.347 | 2.6183 | 1.4119 | | mobilenet_v2 | 96 | 1.0 | 1.0001 | 0.7309 | 1.0396 | 1.4291 | 1.4057 | | hf_GPT2_large | 4 | 1.0003 | 0.9804 | 0.0 | 0.0 | 0.0 | 1.3823 | | mobilenet_v3_large | 32 | 1.0038 | 1.1065 | 1.0272 | 1.3768 | 1.9995 | 1.3618 | | shufflenet_v2_x1_0 | 128 | 1.0008 | 1.0621 | 0.807 | 1.1947 | 1.5531 | 1.3466 | | fastNLP_Bert | 6 | 0.9989 | 0.9768 | 0.7537 | 0.0 | 1.3708 | 1.3464 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9401 | 1.2727 | 1.1856 | 1.8012 | 1.3174 | | densenet121 | 4 | 1.0008 | 1.0079 | 2.3471 | 1.4375 | 5.2124 | 1.3048 | | mnasnet1_0 | 32 | 1.002 | 1.0959 | 0.8595 | 1.2986 | 1.4694 | 1.2804 | | squeezenet1_1 | 32 | 1.001 | 0.9982 | 1.0557 | 1.1602 | 1.8634 | 1.2722 | | functorch_dp_cifar10 | 64 | 1.0016 | 0.9726 | 2.1894 | 1.1956 | 3.7277 | 1.2509 | | resnet18 | 16 | 1.0067 | 1.0992 | 1.1606 | 1.3949 | 1.9712 | 1.2497 | | LearningToPaint | 96 | 1.0031 | 1.0649 | 0.8611 | 1.2372 | 1.2607 | 1.2119 | | timm_efficientnet | 32 | 0.9576 | 0.813 | 0.6983 | 1.0842 | 1.3303 | 1.2046 | | resnext50_32x4d | 8 | 1.0011 | 1.0877 | 1.2336 | 1.3743 | 2.1691 | 1.1915 | | pytorch_unet | 1 | 0.9999 | 0.9981 | 0.8445 | 1.0754 | 1.2019 | 1.1862 | | hf_Bart | 4 | 1.0134 | 0.9734 | 0.7282 | 0.0 | 1.1777 | 1.1847 | | hf_Bert | 4 | 1.0249 | 0.9955 | 0.7295 | 0.0 | 1.2511 | 1.1761 | | resnet50 | 32 | 0.9993 | 0.9926 | 0.758 | 1.1617 | 1.2047 | 1.1684 | | vgg16 | 64 | 0.9999 | 0.9991 | 0.8593 | 0.997 | 1.1746 | 1.168 | | Super_SloMo | 6 | 1.0003 | 0.9977 | 0.8666 | 0.0 | 1.1792 | 1.1641 | | alexnet | 128 | 0.9993 | 0.9984 | 0.8022 | 1.0004 | 1.1621 | 1.1636 | | hf_DistilBert | 8 | 1.0007 | 0.9545 | 0.6706 | 0.0 | 1.1563 | 1.1574 | | pytorch_struct | 200 | 0.9892 | 0.7368 | 0.8781 | 0.8824 | 1.8205 | 1.1475 | | hf_Reformer | 4 | 0.9968 | 0.0 | 0.9269 | 0.0 | 1.11 | 1.1335 | | Background_Matting | 4 | 1.0003 | 1.0217 | 0.8652 | 1.0811 | 1.1144 | 1.1065 | | timm_regnet | 32 | 0.9652 | 0.9632 | 0.7803 | 1.0936 | 1.1265 | 1.093 | | pytorch_stargan | 16 | 0.9989 | 0.9838 | 0.866 | 0.9884 | 1.1214 | 1.091 | | dcgan | 32 | 0.9877 | 1.0056 | 1.279 | 1.155 | 1.6411 | 1.0791 | | drq | 1 | 1.0139 | 0.8405 | 1.719 | 1.0412 | 2.4854 | 1.0706 | | yolov3 | 16 | 1.0001 | 0.9952 | 0.7899 | 1.1839 | 1.079 | 1.0643 | | timm_vision_transformer_large | 8 | 1.0001 | 0.99 | 0.0 | 0.9783 | 1.0461 | 1.0377 | | attention_is_all_you_need_pytorch | 256 | 1.0001 | 0.9731 | 0.0 | 0.0 | 1.0437 | 1.032 | | tts_angular | 64 | 0.9883 | 0.9665 | 0.9902 | 0.9945 | 1.0042 | 1.0127 | | hf_BigBird | 2 | 0.9921 | 0.947 | 0.957 | 0.0 | 1.0979 | 1.0012 | | demucs | 4 | 1.0001 | 1.0001 | 0.9991 | 0.9999 | 0.9995 | 1.0003 | | timm_vovnet | 32 | 0.9078 | 0.9041 | 0.7121 | 0.9791 | 0.9903 | 0.9985 | | soft_actor_critic | 256 | 0.9977 | 0.754 | 1.0691 | 0.9931 | 1.452 | 0.974 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9628 | 0.5844 | 0.9429 | 0.9043 | 0.9643 | | lennard_jones | 1000 | 0.9648 | 0.8214 | 1.038 | 1.0215 | 1.8383 | 0.9574 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.9587 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenet_v2_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | resnet50_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v2_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | yolov3 | 16 | 2.6946 | 8.0738 | 11.2944 | 43.3593 | 421.2781 | 390.1045 | | hf_T5_large | 2 | 12.8459 | 41.0039 | nan | nan | 208.1634 | 98.7696 | | timm_efficientdet | 1 | 19.5214 | 36.4621 | nan | nan | 472.193 | 78.5717 | | hf_GPT2_large | 4 | 4.7965 | 18.7494 | nan | nan | nan | 60.3479 | | densenet121 | 4 | 1.863 | 12.3631 | 18.7436 | 88.0004 | 43.8402 | 42.6252 | | timm_vision_transformer_large | 8 | 2.1831 | 13.7783 | nan | 24.2048 | 109.1333 | 37.7403 | | timm_nfnet | 128 | 1.7778 | 7.224 | nan | 29.3511 | 29.2828 | 25.6315 | | BERT_pytorch | 16 | 1.408 | 7.3531 | nan | nan | 93.2451 | 25.0133 | | speech_transformer | 32 | 1.5601 | 8.0467 | nan | nan | 155.8826 | 24.9263 | | hf_BigBird | 2 | 7.1877 | 13.3445 | 28.8858 | nan | 40.5724 | 24.4433 | | hf_Bart | 4 | 1.3634 | 7.7226 | 11.7663 | nan | 49.8071 | 23.1735 | | hf_T5 | 8 | 1.9849 | 8.891 | nan | nan | nan | 22.038 | | pytorch_struct | 200 | 0.2299 | 0.786 | 1.3498 | 4.0654 | 77.2184 | 21.6208 | | fastNLP_Bert | 6 | 1.4528 | 6.6596 | 10.1622 | nan | 65.9835 | 21.0276 | | attention_is_all_you_need_pytorch | 256 | 1.0727 | 7.0829 | nan | nan | 138.0688 | 20.7995 | | timm_regnet | 32 | 2.1438 | 7.87 | 20.7961 | 46.7355 | 20.7534 | 19.5922 | | hf_Bert | 4 | 1.3338 | 6.111 | 8.8726 | nan | 30.6003 | 18.6261 | | timm_efficientnet | 32 | 1.597 | 6.3786 | 15.6253 | 52.3706 | 19.6752 | 18.5284 | | shufflenet_v2_x1_0 | 128 | 0.8135 | 4.952 | 7.0916 | 26.2354 | 17.7916 | 17.8256 | | hf_GPT2 | 4 | 1.2638 | 5.9182 | 9.0291 | nan | 63.8222 | 17.4366 | | mobilenet_v3_large | 32 | 0.7653 | 4.4271 | 6.2051 | 52.4655 | 29.959 | 16.0336 | | Super_SloMo | 6 | 0.9414 | 4.7808 | 6.4957 | nan | 17.1196 | 16.0166 | | Background_Matting | 4 | 0.6134 | 4.1258 | 6.477 | 28.9642 | 16.4429 | 15.4289 | | mobilenet_v2 | 96 | 0.7034 | 4.1268 | 6.2732 | 36.5168 | 16.279 | 15.3145 | | mnasnet1_0 | 32 | 0.7014 | 4.1117 | 5.8817 | 30.2405 | 29.9614 | 14.7027 | | timm_vovnet | 32 | 1.4177 | 4.3209 | 10.1196 | 23.2123 | 15.4898 | 14.3583 | | resnext50_32x4d | 8 | 0.7751 | 4.5259 | 6.3593 | 28.1888 | 26.9254 | 14.1106 | | hf_Albert | 8 | 0.9434 | 5.5883 | 8.2863 | nan | 41.154 | 13.9319 | | resnet50 | 32 | 0.7658 | 4.5668 | 6.5562 | 31.6857 | 15.0163 | 13.8694 | | hf_Reformer | 4 | 2.3633 | nan | 9.2796 | nan | 35.777 | 13.2993 | | timm_vision_transformer | 8 | 0.7335 | 4.2715 | 5.6705 | 9.1399 | 143.3979 | 11.9762 | | timm_resnest | 32 | 0.4944 | 2.5086 | 3.5361 | 35.1848 | 134.3393 | 10.357 | | hf_DistilBert | 8 | 0.451 | 2.8942 | 5.7847 | nan | 18.879 | 9.3411 | | functorch_dp_cifar10 | 64 | 0.3339 | 1.8771 | 2.7056 | 5.4024 | 25.1969 | 8.6672 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3516 | 2.1508 | 2.893 | 3.7342 | 8.3594 | 7.7951 | | pytorch_unet | 1 | 0.3872 | 1.8887 | 2.8571 | 19.446 | 8.3426 | 7.6875 | | resnet18 | 16 | 0.367 | 1.7235 | 2.4475 | 17.4241 | 22.5813 | 7.0437 | | LearningToPaint | 96 | 0.3862 | 1.7658 | 2.5783 | 24.1451 | 6.9601 | 6.8803 | | pytorch_stargan | 16 | 0.3661 | 2.2699 | 3.0764 | 3.7532 | 105.5884 | 6.3037 | | squeezenet1_1 | 32 | 0.2253 | 0.9562 | 1.4211 | 4.543 | 4.0946 | 3.705 | | drq | 1 | 0.1383 | 0.447 | 0.7869 | 3.4175 | 3.7932 | 3.2698 | | vgg16 | 64 | 0.1761 | 0.6425 | 1.0353 | 2.4605 | 3.6554 | 3.0491 | | soft_actor_critic | 256 | 0.1967 | 0.3309 | 0.5678 | 1.5231 | 3.3987 | 2.6718 | | alexnet | 128 | 0.1492 | 0.4065 | 0.6674 | 2.3697 | 3.0556 | 2.5189 | | dcgan | 32 | 0.1616 | 0.4168 | 0.62 | 3.7183 | 2.6623 | 2.376 | | nvidia_deeprecommender | 256 | 0.1959 | 0.4069 | 0.6608 | 2.3903 | 4.1141 | 2.0136 | | lennard_jones | 1000 | 0.1369 | 0.2851 | 0.4879 | 1.055 | 2.0382 | 1.6666 | | tts_angular | 64 | 0.2064 | 0.2619 | 0.3892 | 0.9866 | 2.0799 | 1.5386 | | demucs | 4 | 0.2932 | 0.2986 | 0.3119 | 0.2959 | 0.2013 | 0.2004 | | dlrm | 2048 | nan | nan | nan | nan | 3.5672 | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.4096 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.7837 | 1.3106 | 1.3377 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2822 | nan | 0.8804 | 1.2559 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1857 | 1.1914 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | nan | 0.8343 | 1.1671 | | hf_T5 | 8 | 0.9527 | 0.9445 | nan | nan | nan | 1.1507 | | hf_Bart | 4 | 0.9617 | 0.879 | 0.3244 | nan | 0.853 | 1.1395 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3372 | 0.9742 | 1.0823 | 1.1267 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4302 | nan | 0.8205 | 1.1123 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | hf_GPT2 | 4 | 0.9548 | 0.887 | 0.353 | nan | 0.9505 | 1.1071 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.1056 | | hf_GPT2_large | 4 | 0.936 | 0.8768 | nan | nan | nan | 1.0941 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.8959 | 1.0442 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.712 | 0.9952 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8284 | 0.9907 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9878 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.9692 | | hf_Bert | 4 | 0.9683 | 0.8952 | 0.3395 | nan | 0.8564 | 0.9684 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.8024 | 0.8974 | 0.9679 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9149 | 0.3919 | 0.9141 | 0.8848 | 0.9646 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.816 | 0.9429 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8551 | 0.857 | 0.9307 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8203 | 0.8303 | 0.9303 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9231 | 0.9271 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8737 | 0.922 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.2989 | nan | 0.7841 | 0.9208 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3595 | 0.9749 | 0.9139 | 0.9164 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.9148 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8263 | 0.8531 | 0.9097 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3367 | 0.797 | 0.8565 | 0.8913 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.871 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3591 | 0.6971 | 0.6902 | 0.8401 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7741 | 0.8251 | 0.8316 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6515 | 0.6882 | 0.7462 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | dlrm | 2048 | nan | nan | nan | nan | 0.7035 | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | MT5ForConditionalGeneration | 2 | 1.0262 | 0.9254 | 0.0 | 0.0 | 4.6504 | 2.2192 | | DistillGPT2 | 1 | 1.0344 | 0.944 | 1.1266 | 0.0 | 2.0795 | 1.8463 | | OPTForCausalLM | 4 | 1.0129 | 0.8998 | 1.6026 | 0.0 | 2.7097 | 1.6882 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9784 | 0.0 | 0.0 | 1.6741 | 1.663 | | RobertaForCausalLM | 4 | 1.0464 | 0.9449 | 1.4924 | 0.0 | 2.7268 | 1.6084 | | MegatronBertForQuestionAnswering | 8 | 1.0444 | 0.9429 | 1.0235 | 0.0 | 1.6361 | 1.5772 | | MegatronBertForCausalLM | 2 | 1.0421 | 0.9443 | 1.5697 | 0.0 | 3.0768 | 1.5678 | | MobileBertForQuestionAnswering | 32 | 1.0187 | 0.9249 | 0.0 | 0.0 | 2.9281 | 1.5577 | | PLBartForConditionalGeneration | 8 | 1.0116 | 0.9033 | 1.0868 | 0.0 | 1.794 | 1.5538 | | ElectraForCausalLM | 1 | 1.0438 | 0.9492 | 2.0902 | 0.0 | 4.815 | 1.5503 | | XGLMForCausalLM | 1 | 1.012 | 0.8772 | 0.0 | 0.0 | 2.5999 | 1.5381 | | MobileBertForMaskedLM | 16 | 1.0256 | 0.9199 | 0.0 | 0.0 | 2.7678 | 1.5312 | | CamemBert | 1 | 1.0543 | 0.956 | 1.5293 | 0.0 | 2.4364 | 1.507 | | MBartForConditionalGeneration | 8 | 1.0158 | 0.922 | 0.9469 | 0.0 | 1.5609 | 1.4932 | | M2M100ForConditionalGeneration | 2 | 1.0698 | 0.878 | 1.3876 | 0.0 | 2.6513 | 1.4921 | | PegasusForConditionalGeneration | 4 | 1.0145 | 0.9036 | 1.2401 | 0.0 | 2.298 | 1.4806 | | YituTechConvBert | 1 | 1.0247 | 0.9368 | 0.0 | 0.0 | 3.7787 | 1.4538 | | T5ForConditionalGeneration | 4 | 1.0008 | 0.9687 | 0.0 | 0.0 | 1.4298 | 1.4341 | | XLNetLMHeadModel | 4 | 0.9997 | 0.9662 | 0.0 | 0.0 | 1.4338 | 1.4235 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9856 | 0.0 | 0.0 | 1.3582 | 1.3443 | | AlbertForQuestionAnswering | 2 | 1.0011 | 1.0024 | 0.0 | 0.0 | 1.2968 | 1.2884 | | AlbertForMaskedLM | 2 | 0.9984 | 1.0013 | 0.0 | 0.0 | 1.2904 | 1.2863 | | TrOCRForCausalLM | 8 | 1.0177 | 0.9459 | 0.8088 | 0.0 | 1.2555 | 1.2853 | | Speech2Text2ForCausalLM | 64 | 1.0042 | 0.9432 | 0.7366 | 0.0 | 1.2531 | 1.2828 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9885 | 0.7382 | 0.0 | 1.2524 | 1.2394 | | PegasusForCausalLM | 8 | 1.014 | 0.9212 | 0.8235 | 0.0 | 1.2405 | 1.2246 | | GoogleFnet | 1 | 1.0018 | 0.817 | 0.9941 | 1.117 | 1.9193 | 1.2034 | | BartForConditionalGeneration | 1 | 1.016 | 0.9945 | 0.0 | 0.0 | 1.2881 | 1.1963 | | DistilBertForQuestionAnswering | 32 | 1.032 | 0.9885 | 0.722 | 0.0 | 1.192 | 1.1848 | | DebertaForQuestionAnswering | 4 | 0.9263 | 0.7387 | 0.9323 | 0.0 | 1.2979 | 1.1844 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0135 | 0.9441 | 0.7582 | 0.0 | 1.1961 | 1.1774 | | DistilBertForMaskedLM | 16 | 1.0236 | 0.9824 | 0.7573 | 0.0 | 1.1816 | 1.1765 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9693 | 0.0 | 0.0 | 1.1699 | 1.1763 | | T5Small | 1 | 1.0266 | 0.9533 | 0.0 | 0.0 | 1.2799 | 1.1665 | | PLBartForCausalLM | 16 | 1.0127 | 0.9512 | 0.7907 | 0.0 | 1.1304 | 1.1497 | | BartForCausalLM | 2 | 0.9992 | 0.9665 | 0.7293 | 0.0 | 1.1075 | 1.1121 | | RobertaForQuestionAnswering | 64 | 1.0001 | 0.9838 | 0.7409 | 0.0 | 1.089 | 1.0798 | | BertForQuestionAnswering | 64 | 1.0004 | 0.9719 | 0.736 | 0.0 | 1.0932 | 1.0748 | | DebertaForMaskedLM | 4 | 0.9354 | 0.8021 | 0.7343 | 0.0 | 1.0867 | 1.0674 | | MBartForCausalLM | 16 | 1.0087 | 0.9641 | 0.7246 | 0.0 | 1.0582 | 1.064 | | BertForMaskedLM | 64 | 1.0 | 0.9636 | 0.7181 | 0.0 | 1.0357 | 1.0402 | | BlenderbotSmallForCausalLM | 64 | 1.001 | 0.9101 | 0.6521 | 0.0 | 1.0062 | 1.0381 | | BigBird | 1 | 0.9927 | 0.9363 | 1.0063 | 0.0 | 1.1008 | 1.0055 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | XLNetLMHeadModel | 4 | 3.8563 | 20.1291 | nan | nan | 154.9396 | 55.0385 | | MobileBertForMaskedLM | 16 | 7.9926 | 27.645 | nan | nan | 95.1586 | 53.2011 | | MobileBertForQuestionAnswering | 32 | 7.778 | 27.6918 | nan | nan | 80.211 | 53.144 | | M2M100ForConditionalGeneration | 2 | 2.6347 | 15.4058 | 25.0478 | nan | 88.6459 | 41.5232 | | MBartForConditionalGeneration | 8 | 2.842 | 14.9887 | 24.0733 | nan | 57.7763 | 38.9562 | | PegasusForConditionalGeneration | 4 | 2.7425 | 14.5211 | 23.2205 | nan | 74.3495 | 38.7677 | | MegatronBertForCausalLM | 2 | 3.3214 | 12.9395 | 18.8926 | nan | 66.4161 | 37.599 | | MegatronBertForQuestionAnswering | 8 | 3.1717 | 12.6607 | 18.6026 | nan | 55.4472 | 36.7751 | | XGLMForCausalLM | 1 | 2.2243 | 11.9095 | nan | nan | 62.0082 | 36.4739 | | BartForConditionalGeneration | 1 | 2.8137 | 14.8852 | nan | nan | 46.9336 | 36.3428 | | MT5ForConditionalGeneration | 2 | 3.2756 | 13.4151 | nan | nan | 104.5949 | 32.0014 | | DebertaForMaskedLM | 4 | 4.7627 | 10.9781 | 44.9229 | nan | 120.725 | 31.1942 | | DebertaForQuestionAnswering | 4 | 4.7076 | 11.2242 | 44.558 | nan | 94.8567 | 29.4266 | | BlenderbotSmallForConditionalGeneration | 32 | 1.7185 | 9.5893 | 14.6735 | nan | 50.7147 | 25.7547 | | YituTechConvBert | 1 | 2.0965 | 9.6719 | nan | nan | 115.6858 | 25.4287 | | T5ForConditionalGeneration | 4 | 1.9998 | 8.9302 | nan | nan | 58.5377 | 24.9269 | | PLBartForConditionalGeneration | 8 | 1.3877 | 7.8161 | 11.2448 | nan | 57.5267 | 24.1229 | | BigBird | 1 | 7.1363 | 13.4667 | 28.6926 | nan | 41.151 | 24.0747 | | T5Small | 1 | 2.0038 | 9.0658 | nan | nan | 54.3517 | 20.5314 | | RobertaForCausalLM | 4 | 1.4215 | 6.2233 | 8.733 | nan | 59.7399 | 19.1079 | | ElectraForQuestionAnswering | 64 | 1.3196 | 6.2054 | nan | nan | 32.0659 | 18.9681 | | LayoutLMForMaskedLM | 16 | 1.4595 | 6.4635 | nan | nan | 32.0362 | 18.9255 | | BertForMaskedLM | 64 | 1.3182 | 6.1585 | 9.3256 | nan | 32.4413 | 18.8293 | | ElectraForCausalLM | 1 | 1.4112 | 6.3391 | 8.6323 | nan | 21.4781 | 18.4255 | | CamemBert | 1 | 1.3998 | 6.083 | 8.8757 | nan | 21.526 | 17.9811 | | LayoutLMForSequenceClassification | 16 | 1.4734 | 6.4139 | 9.7085 | nan | 49.8072 | 17.9259 | | GPT2ForSequenceClassification | 4 | 1.2966 | 5.902 | nan | nan | 31.4584 | 17.5774 | | RobertaForQuestionAnswering | 64 | 1.361 | 6.2194 | 8.907 | nan | 19.7272 | 16.9387 | | BertForQuestionAnswering | 64 | 1.3724 | 6.1077 | 9.0325 | nan | 19.9996 | 16.7182 | | PegasusForCausalLM | 8 | 1.0348 | 5.5356 | 8.5779 | nan | 40.7253 | 16.453 | | MBartForCausalLM | 16 | 0.9629 | 5.5287 | 8.2172 | nan | 29.3925 | 16.1161 | | OPTForCausalLM | 4 | 1.0786 | 5.867 | 13.239 | nan | 35.6128 | 15.8746 | | TrOCRForCausalLM | 8 | 1.0393 | 5.7252 | 7.8765 | nan | 21.9189 | 15.303 | | BartForCausalLM | 2 | 1.0117 | 5.5461 | 8.2662 | nan | 28.6964 | 15.2956 | | AlbertForQuestionAnswering | 2 | 1.2416 | 5.819 | nan | nan | 16.702 | 13.6368 | | AlbertForMaskedLM | 2 | 1.1347 | 5.7649 | nan | nan | 28.2303 | 13.5234 | | GoogleFnet | 1 | 0.7819 | 3.1961 | 10.0079 | 9.4636 | 23.6876 | 11.9908 | | BlenderbotSmallForCausalLM | 64 | 0.6073 | 3.708 | 5.7262 | nan | 23.1824 | 11.4275 | | Speech2Text2ForCausalLM | 64 | 0.5623 | 3.0034 | 4.5045 | nan | 24.8957 | 10.2719 | | PLBartForCausalLM | 16 | 0.5053 | 2.9211 | 4.2612 | nan | 23.9939 | 9.922 | | DistilBertForQuestionAnswering | 32 | 0.4751 | 2.9444 | 6.1878 | nan | 30.3674 | 9.8744 | | DistilBertForMaskedLM | 16 | 0.4653 | 2.9409 | 5.7567 | nan | 20.4187 | 9.7413 | | DistillGPT2 | 1 | 0.6851 | 2.9742 | 4.3883 | nan | 34.166 | 9.4494 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | DebertaForQuestionAnswering | 4 | 0.9845 | 1.0525 | 0.3309 | nan | 0.3569 | 1.13 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3714 | 1.0813 | 0.7687 | 1.1065 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0318 | 1.0912 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8564 | 1.087 | | BigBird | 1 | 0.999 | 0.9542 | 0.4213 | nan | 0.822 | 1.062 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3554 | nan | 0.4265 | 1.0346 | | DistilBertForQuestionAnswering | 32 | 1.0 | 0.9046 | 0.3328 | nan | 0.8394 | 1.0048 | | M2M100ForConditionalGeneration | 2 | 0.9977 | 0.9857 | 0.4249 | nan | 0.7197 | 1.0045 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.9339 | 1.004 | | BertForQuestionAnswering | 64 | 1.0 | 0.9467 | 0.332 | nan | 0.9354 | 1.0032 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9467 | 0.3319 | nan | 0.9354 | 1.0032 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.9361 | 1.0025 | | XGLMForCausalLM | 1 | 0.9974 | 0.9999 | nan | nan | 0.8528 | 0.9999 | | DistillGPT2 | 1 | 0.9984 | 0.7704 | 0.3571 | nan | 0.8184 | 0.9933 | | PegasusForConditionalGeneration | 4 | 0.9993 | 0.9002 | 0.3809 | nan | 0.7318 | 0.9895 | | BartForConditionalGeneration | 1 | 1.0 | 0.8465 | nan | nan | 0.8244 | 0.9819 | | XLNetLMHeadModel | 4 | 1.0001 | 0.8976 | nan | nan | 0.9717 | 0.9807 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.8025 | 0.9784 | | CamemBert | 1 | 0.998 | 0.7977 | 0.3504 | nan | 0.8088 | 0.9708 | | AlbertForQuestionAnswering | 2 | 1.0 | 0.9369 | nan | nan | 0.6763 | 0.9674 | | PegasusForCausalLM | 8 | 0.9778 | 0.9323 | 0.4075 | nan | 0.802 | 0.9625 | | PLBartForConditionalGeneration | 8 | 1.0 | 0.8221 | 0.3314 | nan | 0.7548 | 0.9608 | | AlbertForMaskedLM | 2 | 0.9999 | 0.9172 | nan | nan | 0.6633 | 0.9567 | | TrOCRForCausalLM | 8 | 1.0 | 0.8048 | 0.3624 | nan | 0.7873 | 0.9427 | | MBartForConditionalGeneration | 8 | 1.0 | 0.8136 | 0.342 | nan | 0.7949 | 0.9411 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.888 | 0.9409 | | BartForCausalLM | 2 | 1.0 | 0.8847 | 0.3484 | nan | 0.8389 | 0.9329 | | MegatronBertForQuestionAnswering | 8 | 0.923 | 0.8265 | 0.3609 | nan | 0.7975 | 0.923 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3433 | nan | 0.8321 | 0.922 | | MBartForCausalLM | 16 | 1.0 | 0.8629 | 0.352 | nan | 0.8181 | 0.9194 | | DistilBertForMaskedLM | 16 | 0.9998 | 0.9138 | 0.3377 | nan | 0.8055 | 0.9137 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0 | 0.9036 | 0.3443 | nan | 0.7612 | 0.913 | | OPTForCausalLM | 4 | 0.9979 | 0.7508 | 0.3322 | nan | 0.763 | 0.9125 | | ElectraForCausalLM | 1 | 1.0 | 0.9107 | 0.3556 | nan | 0.6123 | 0.9107 | | RobertaForCausalLM | 4 | 0.9058 | 0.7778 | 0.3513 | nan | 0.7882 | 0.9058 | | PLBartForCausalLM | 16 | 1.0 | 0.8805 | 0.3568 | nan | 0.8028 | 0.9029 | | Speech2Text2ForCausalLM | 64 | 0.9565 | 0.8462 | 0.3538 | nan | 0.7768 | 0.889 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3578 | nan | 0.7277 | 0.8452 | | MobileBertForMaskedLM | 16 | 0.9997 | 0.9179 | nan | nan | 0.5861 | 0.8035 | | MegatronBertForCausalLM | 2 | 0.7066 | 0.7066 | 0.3654 | nan | 0.7066 | 0.7066 | | MT5ForConditionalGeneration | 2 | 0.6173 | 0.6173 | nan | nan | 0.6173 | 0.6173 | | MobileBertForQuestionAnswering | 32 | 1.0 | 0.9716 | nan | nan | 0.4668 | 0.6097 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9955 | 0.8411 | 1.2485 | 1.7846 | 1.7371 | | lcnet_050 | 128 | 0.9569 | 0.9502 | 0.7668 | 1.4959 | 1.6591 | 1.6188 | | coat_lite_mini | 128 | 0.9999 | 0.9967 | 0.8443 | 1.0555 | 1.6093 | 1.6033 | | tnt_s_patch16_224 | 64 | 0.9999 | 0.9984 | 0.0 | 1.5989 | 1.5295 | 1.5135 | | dm_nfnet_f0 | 128 | 0.9994 | 1.0 | 0.0 | 1.2117 | 1.4719 | 1.4226 | | twins_pcpvt_base | 32 | 1.0074 | 0.9883 | 0.9602 | 1.3673 | 1.4916 | 1.4143 | | xcit_large_24_p8_224 | 5 | 1.0026 | 0.9913 | 0.0 | 0.0 | 1.4488 | 1.4108 | | crossvit_9_240 | 64 | 1.0061 | 1.0046 | 0.0 | 1.0509 | 1.4149 | 1.3907 | | volo_d1_224 | 64 | 1.0 | 0.9962 | 0.0 | 1.1313 | 1.3925 | 1.3679 | | mobilenetv2_100 | 128 | 0.9662 | 0.964 | 0.7057 | 1.0098 | 1.3132 | 1.3536 | | mobilenetv3_large_100 | 128 | 0.9665 | 0.963 | 0.7637 | 1.1669 | 1.3366 | 1.3474 | | regnety_002 | 128 | 0.9796 | 0.9859 | 0.8561 | 1.363 | 1.4971 | 1.3317 | | dla102 | 64 | 0.9998 | 0.9962 | 0.8017 | 1.2856 | 1.3449 | 1.3235 | | gmixer_24_224 | 64 | 1.0 | 0.8414 | 0.0 | 0.9868 | 1.3497 | 1.3184 | | resnest101e | 32 | 1.0041 | 1.0368 | 0.78 | 1.2022 | 1.3703 | 1.3125 | | nfnet_l0 | 64 | 0.9994 | 0.7985 | 0.6964 | 1.049 | 1.3728 | 1.3091 | | adv_inception_v3 | 128 | 0.9999 | 0.9974 | 0.0 | 1.1251 | 1.3265 | 1.3083 | | inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 1.1253 | 1.3272 | 1.3063 | | gluon_inception_v3 | 128 | 0.9999 | 0.9992 | 0.0 | 1.1249 | 1.327 | 1.3057 | | hrnet_w18 | 2 | 1.0089 | 1.1011 | 2.0045 | 1.482 | 4.8677 | 1.3036 | | fbnetv3_b | 128 | 0.965 | 0.9579 | 0.7605 | 1.1304 | 1.2831 | 1.2958 | | mnasnet_100 | 128 | 0.9671 | 0.9616 | 0.7862 | 1.1562 | 1.2632 | 1.2813 | | sebotnet33ts_256 | 64 | 0.9765 | 0.8072 | 0.0 | 1.0534 | 1.2639 | 1.2697 | | tf_efficientnet_b0 | 128 | 0.9771 | 0.7837 | 0.0 | 0.9852 | 1.2599 | 1.2668 | | botnet26t_256 | 128 | 0.9859 | 0.9854 | 0.7842 | 1.2265 | 1.2641 | 1.2629 | | fbnetc_100 | 128 | 0.9665 | 0.9632 | 0.7903 | 1.1875 | 1.2504 | 1.2627 | | spnasnet_100 | 128 | 0.9622 | 0.9551 | 0.7732 | 1.132 | 1.2355 | 1.2529 | | res2net50_14w_8s | 2 | 1.0019 | 1.0164 | 2.0615 | 1.4393 | 5.4203 | 1.2519 | | jx_nest_base | 32 | 0.9999 | 0.9947 | 0.0 | 1.2103 | 1.2754 | 1.2519 | | cspdarknet53 | 64 | 0.9579 | 0.9539 | 0.7351 | 1.1851 | 1.2264 | 1.2348 | | res2next50 | 2 | 1.004 | 1.0434 | 2.2471 | 1.3813 | 4.6815 | 1.2326 | | selecsls42b | 128 | 1.0001 | 0.9967 | 0.8147 | 1.2083 | 1.2453 | 1.231 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9766 | 0.7421 | 1.145 | 1.2239 | 1.2267 | | rexnet_100 | 128 | 0.9731 | 0.8163 | 0.0 | 0.9831 | 1.2133 | 1.219 | | pit_b_224 | 64 | 1.0001 | 0.9987 | 0.0 | 1.0539 | 1.2269 | 1.2159 | | eca_botnext26ts_256 | 64 | 0.9745 | 0.7701 | 0.6216 | 1.0176 | 1.2397 | 1.2148 | | eca_halonext26ts | 64 | 0.9737 | 0.7751 | 0.6286 | 1.017 | 1.2316 | 1.2133 | | tinynet_a | 128 | 0.9663 | 0.7751 | 0.6201 | 0.9716 | 1.1908 | 1.1995 | | mobilevit_s | 32 | 0.9761 | 0.7692 | 0.5952 | 0.9701 | 1.1937 | 1.1987 | | pnasnet5large | 16 | 0.9996 | 0.9984 | 0.0 | 1.0835 | 1.2099 | 1.1943 | | dpn107 | 32 | 0.9588 | 0.9512 | 0.7798 | 1.0297 | 1.1798 | 1.1934 | | res2net101_26w_4s | 64 | 0.9998 | 0.9971 | 0.7704 | 1.1677 | 1.2288 | 1.1893 | | convit_base | 32 | 0.9996 | 0.9957 | 0.0 | 1.1925 | 1.2487 | 1.1868 | | repvgg_a2 | 128 | 0.9644 | 0.9625 | 0.8269 | 1.1224 | 1.1697 | 1.1663 | | cait_m36_384 | 2 | 0.9998 | 0.9945 | 0.0 | 1.0967 | 1.205 | 1.1567 | | convnext_base | 32 | 0.9999 | 0.9978 | 0.0 | 1.0457 | 1.1874 | 1.1521 | | poolformer_m36 | 64 | 1.0 | 0.999 | 0.0 | 0.0 | 1.1679 | 1.1469 | | tf_mixnet_l | 64 | 0.9719 | 0.8764 | 0.7242 | 1.0063 | 1.143 | 1.1456 | | swin_base_patch4_window7_224 | 64 | 1.0 | 0.9795 | 0.0 | 0.9928 | 1.1439 | 1.1346 | | mixnet_l | 64 | 0.9714 | 0.8725 | 0.7118 | 1.0065 | 1.1316 | 1.1289 | | beit_base_patch16_224 | 64 | 1.0 | 0.9821 | 0.0 | 0.9529 | 1.1189 | 1.1074 | | gmlp_s16_224 | 64 | 1.0 | 0.9921 | 0.0 | 0.9955 | 1.1041 | 1.09 | | deit_base_distilled_patch16_224 | 64 | 0.9996 | 0.9987 | 0.7703 | 1.0129 | 1.0985 | 1.0865 | | vit_base_patch16_224 | 64 | 1.0 | 0.9985 | 0.7705 | 0.9738 | 1.0899 | 1.0785 | | convmixer_768_32 | 32 | 0.9999 | 1.0001 | 0.0 | 1.0621 | 1.0772 | 1.0747 | | gluon_xception65 | 32 | 0.9999 | 0.9965 | 0.0 | 1.0408 | 1.0876 | 1.0725 | | swsl_resnext101_32x16d | 32 | 0.9998 | 1.0001 | 0.0 | 1.1075 | 1.1072 | 1.0717 | | gernet_l | 128 | 0.9747 | 0.9727 | 0.8215 | 1.0985 | 1.0764 | 1.0706 | | mixer_b16_224 | 64 | 1.0001 | 0.9983 | 0.7607 | 0.9786 | 1.0637 | 1.0319 | | visformer_small | 128 | 1.0001 | 1.0027 | 0.7976 | 1.0215 | 1.0485 | 1.0136 | | resmlp_12_224 | 128 | 1.0005 | 1.0001 | 0.6958 | 0.0 | 1.0159 | 0.9953 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | hrnet_w18 | 2 | 5.5514 | 31.4987 | 52.5308 | 194.6411 | 99.5082 | 81.9802 | | pnasnet5large | 16 | 4.2069 | 21.6242 | nan | 122.4201 | 71.2949 | 67.2041 | | mobilevit_s | 32 | 1.6977 | 7.171 | 14.752 | 41.4298 | 341.3652 | 62.3666 | | xcit_large_24_p8_224 | 5 | 2.5437 | 16.6772 | nan | nan | 172.8777 | 57.0318 | | twins_pcpvt_base | 32 | 2.1953 | 13.2558 | 22.4707 | 45.1882 | 507.7757 | 54.2039 | | cait_m36_384 | 2 | 2.7258 | 17.8566 | nan | 45.7956 | 164.8171 | 50.2666 | | res2net101_26w_4s | 64 | 2.6877 | 16.0726 | 26.8113 | 80.9679 | 54.4775 | 49.5062 | | swin_base_patch4_window7_224 | 64 | 2.5935 | 12.3844 | nan | 58.733 | 177.294 | 45.0007 | | resnest101e | 32 | 2.8543 | 15.8323 | 25.573 | 74.2882 | 111.9496 | 44.8647 | | poolformer_m36 | 64 | 1.7189 | 9.1069 | nan | nan | 46.021 | 43.0803 | | res2net50_14w_8s | 2 | 2.5392 | 14.5878 | 23.6502 | 68.3297 | 45.8749 | 41.8925 | | convnext_base | 32 | 1.2758 | 6.1953 | nan | 20.9357 | 202.7457 | 38.1635 | | dpn107 | 32 | 3.9591 | 13.8539 | 45.5921 | 76.0043 | 40.3708 | 36.3608 | | jx_nest_base | 32 | 1.6763 | 9.356 | nan | 58.9394 | 135.9959 | 33.881 | | fbnetv3_b | 128 | 2.9517 | 10.5374 | 30.2312 | 77.0478 | 35.3708 | 31.9534 | | adv_inception_v3 | 128 | 1.4958 | 8.6965 | nan | 67.2085 | 34.0825 | 31.6169 | | gluon_inception_v3 | 128 | 1.5314 | 8.5064 | nan | 66.9439 | 33.61 | 30.8318 | | gluon_xception65 | 32 | 1.5286 | 10.1092 | nan | 40.8043 | 32.654 | 30.0885 | | inception_v3 | 128 | 1.4983 | 8.2858 | nan | 66.6613 | 33.5144 | 29.7959 | | tf_mixnet_l | 64 | 5.767 | 12.5234 | 27.2068 | 61.5578 | 33.8544 | 29.6373 | | ghostnet_100 | 128 | 2.4536 | 8.8736 | 13.0483 | 58.6961 | 32.3006 | 29.3573 | | dla102 | 64 | 1.5262 | 9.7746 | 14.3492 | 61.3674 | 32.7156 | 29.0289 | | mixnet_l | 64 | 5.297 | 12.0806 | 27.5547 | 60.6602 | 31.97 | 28.5199 | | tnt_s_patch16_224 | 64 | 1.6033 | 10.3956 | nan | 23.6034 | 63.9747 | 28.1925 | | dm_nfnet_f0 | 128 | 1.9599 | 7.3416 | nan | 29.5113 | 28.1901 | 25.49 | | volo_d1_224 | 64 | 1.2527 | 7.5688 | nan | 28.5353 | 74.6738 | 25.3803 | | gmlp_s16_224 | 64 | 1.0091 | 6.239 | nan | 13.523 | 96.9474 | 25.2203 | | res2next50 | 2 | 1.5169 | 8.0974 | 12.088 | 41.4751 | 26.7041 | 24.3022 | | eca_halonext26ts | 64 | 1.3714 | 5.2827 | 11.2345 | 50.3344 | 262.43 | 24.1671 | | swsl_resnext101_32x16d | 32 | 1.6673 | 9.166 | nan | 38.5961 | 26.624 | 24.1502 | | rexnet_100 | 128 | 1.8333 | 7.0955 | nan | 101.6305 | 26.3318 | 23.9861 | | sebotnet33ts_256 | 64 | 1.6193 | 6.2319 | nan | 50.6361 | 132.0687 | 23.882 | | coat_lite_mini | 128 | 0.9143 | 5.3518 | 7.7361 | 14.5681 | 364.9581 | 23.8478 | | crossvit_9_240 | 64 | 1.3631 | 8.165 | nan | 26.4481 | 159.6489 | 23.2843 | | tinynet_a | 128 | 2.0295 | 7.5769 | 19.9529 | 60.7375 | 25.5819 | 22.6985 | | tf_efficientnet_b0 | 128 | 1.7698 | 6.7346 | nan | 60.6858 | 22.6338 | 20.872 | | cspdarknet53 | 64 | 2.2005 | 7.5986 | 19.8213 | 48.6604 | 22.7589 | 20.778 | | gmixer_24_224 | 64 | 1.2353 | 7.2435 | nan | 16.3341 | 71.6232 | 20.4959 | | eca_botnext26ts_256 | 64 | 1.3287 | 4.7948 | 10.7476 | 48.2999 | 290.2904 | 19.8298 | | fbnetc_100 | 128 | 1.9158 | 6.5018 | 18.4877 | 45.035 | 21.5277 | 19.3838 | | spnasnet_100 | 128 | 1.9538 | 6.5388 | 17.2195 | 44.0026 | 21.5531 | 18.809 | | nfnet_l0 | 64 | 1.6932 | 7.1748 | 10.8573 | 27.1068 | 24.2819 | 18.6725 | | botnet26t_256 | 128 | 1.3044 | 4.5699 | 10.9116 | 40.9092 | 105.3566 | 18.3653 | | convit_base | 32 | 0.9673 | 6.1172 | nan | 18.0735 | 84.0672 | 18.0342 | | mobilenetv3_large_100 | 128 | 1.5015 | 5.219 | 12.9894 | 64.1074 | 19.7471 | 17.3543 | | mobilenetv2_100 | 128 | 1.6053 | 5.0752 | 13.1196 | 37.684 | 19.109 | 16.1893 | | mnasnet_100 | 128 | 1.5679 | 5.5386 | 13.3421 | 37.6931 | 18.4296 | 15.7992 | | gernet_l | 128 | 1.9688 | 6.1371 | 16.0066 | 35.7462 | 18.0897 | 15.7917 | | regnety_002 | 128 | 1.5589 | 5.5672 | 13.4735 | 47.1773 | 17.6702 | 15.7589 | | repvgg_a2 | 128 | 1.9613 | 5.9791 | 15.6637 | 43.6358 | 17.5441 | 15.7292 | | convmixer_768_32 | 32 | 1.1394 | 5.9151 | nan | 13.3328 | 19.8959 | 15.7286 | | beit_base_patch16_224 | 64 | 1.0517 | 5.1695 | nan | 13.9188 | 32.1445 | 15.4759 | | resmlp_12_224 | 128 | 0.5144 | 2.7602 | 5.4746 | nan | 38.6184 | 14.8683 | | visformer_small | 128 | 0.8595 | 4.1596 | 5.8685 | 23.9823 | 70.7521 | 14.6137 | | selecsls42b | 128 | 0.6354 | 3.8347 | 5.5048 | 39.064 | 16.4175 | 14.4533 | | pit_b_224 | 64 | 0.9357 | 4.9349 | nan | 12.4917 | 67.5584 | 14.3303 | | deit_base_distilled_patch16_224 | 64 | 0.7137 | 4.6555 | 6.4036 | 10.2768 | 36.0881 | 13.538 | | vit_base_patch16_224 | 64 | 0.7021 | 4.1004 | 6.3604 | 9.4905 | 25.0169 | 12.8291 | | mixer_b16_224 | 64 | 0.5101 | 3.0639 | 5.353 | 10.5805 | 41.5714 | 12.5959 | | ese_vovnet19b_dw | 128 | 0.9625 | 3.1605 | 7.5135 | 30.8903 | 13.0066 | 11.8343 | | lcnet_050 | 128 | 0.9262 | 3.1307 | 6.8576 | 30.9579 | 13.0753 | 11.0715 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 64 | 0.9952 | 0.9645 | nan | 0.9825 | 1.3808 | 1.5001 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1774 | 1.3282 | | nfnet_l0 | 64 | 0.9948 | 0.8256 | 0.2664 | 0.813 | 1.2558 | 1.3209 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | convit_base | 32 | 0.9977 | 0.8861 | nan | 0.9501 | 1.068 | 1.2569 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 1.2472 | | eca_botnext26ts_256 | 64 | 0.9938 | 0.7669 | 0.258 | 0.7642 | 1.1318 | 1.2041 | | eca_halonext26ts | 64 | 0.9938 | 0.768 | 0.2589 | 0.7694 | 1.1317 | 1.2034 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | mobilevit_s | 32 | 0.9959 | 0.7668 | 0.258 | 0.741 | 1.141 | 1.1989 | | cait_m36_384 | 2 | 0.9998 | 0.902 | nan | 0.9203 | 1.011 | 1.139 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 1.0353 | 1.0963 | | tf_mixnet_l | 64 | 0.9956 | 0.8577 | 0.2851 | 0.8572 | 0.9695 | 1.0815 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dla102 | 64 | 0.9841 | 0.9148 | 0.3339 | 0.9504 | 1.0492 | 1.0544 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0219 | 1.0495 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.9913 | 1.0324 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9124 | 1.0084 | | mixnet_l | 64 | 0.995 | 0.8449 | 0.2684 | 0.7907 | 0.8995 | 1.0059 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 0.9272 | 1.0027 | | tnt_s_patch16_224 | 64 | 0.9963 | 0.9715 | nan | 0.8518 | 0.9131 | 1.0027 | | resnest101e | 32 | 0.9972 | 0.9434 | 0.3271 | 0.9425 | 0.9914 | 1.002 | | mixer_b16_224 | 64 | 0.9956 | 0.9574 | 0.3405 | 0.8644 | 0.9357 | 0.997 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9886 | 0.9967 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.915 | 0.9937 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.792 | 0.9913 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.9911 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9095 | 0.9895 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.8585 | 0.989 | | crossvit_9_240 | 64 | 0.9886 | 0.8633 | nan | 0.729 | 0.8063 | 0.9877 | | twins_pcpvt_base | 32 | 0.9971 | 0.9101 | 0.3178 | 0.8351 | 0.8722 | 0.9876 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | gmlp_s16_224 | 64 | 0.9958 | 0.9727 | nan | 0.966 | 0.9267 | 0.9838 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.7899 | 0.9836 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.8575 | 0.9712 | | hrnet_w18 | 2 | 0.9947 | 0.8779 | 0.4003 | 0.8833 | 0.6657 | 0.9689 | | convnext_base | 32 | 0.998 | 0.9059 | nan | 0.7678 | 0.8761 | 0.9606 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9056 | 0.9562 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9005 | 0.951 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net50_14w_8s | 2 | 0.9976 | 0.837 | 0.3866 | 0.8458 | 0.8293 | 0.9317 | | res2next50 | 2 | 0.9972 | 0.8331 | 0.3813 | 0.841 | 0.82 | 0.9281 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9124 | 0.9171 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/wruVVSV.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/KXCHw8L.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/pPyVlcI.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 94%, 50/53 | 98%, 42/43  | 100%, 61/61 |
|       aot_eager        | 94%, 50/53 | 98%, 42/43  | 90%, 55/61  |
|     aot_cudagraphs     | 74%, 39/53 | 53%, 23/43  | 75%, 46/61  |
|      aot_nvfuser       | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|        inductor        | 85%, 45/53 | 93%, 40/43  | 93%, 57/61  |
| inductor_no_cudagraphs | 87%, 46/53 | 93%, 40/43  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.19x    |    1.29x    |    1.06x    |
|      aot_nvfuser       |   1.16x    |    0.0x     |    1.20x    |
|        inductor        |   1.84x    |    2.30x    |    1.56x    |
| inductor_no_cudagraphs |   1.37x    |    1.64x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.86    |    2.54     |    2.09     |
|       aot_eager        |    7.63    |    12.68    |    10.35    |
|     aot_cudagraphs     |    7.76    |    16.14    |    19.75    |
|      aot_nvfuser       |   26.75    |     0.0     |    69.96    |
|        inductor        |   56.04    |    56.96    |    94.80    |
| inductor_no_cudagraphs |   28.04    |    29.28    |    32.82    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.98x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.87x    |    0.87x    |
|     aot_cudagraphs     |   0.42x    |    0.40x    |    0.33x    |
|      aot_nvfuser       |   0.83x    |    0.0x     |    0.85x    |
|        inductor        |   0.83x    |    0.86x    |    0.94x    |
| inductor_no_cudagraphs |   1.00x    |    1.05x    |    1.03x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | BERT_pytorch | 16 | 1.0084 | 0.8302 | 0.0 | 0.0 | 3.2325 | 2.4027 | | hf_Albert | 8 | 1.0011 | 0.9532 | 0.7722 | 0.0 | 2.3807 | 2.3221 | | hf_T5_large | 2 | 1.0191 | 0.8479 | 0.0 | 0.0 | 2.478 | 2.1536 | | hf_GPT2 | 4 | 1.0209 | 0.9839 | 0.8075 | 0.0 | 1.8627 | 1.847 | | hf_T5 | 8 | 0.9992 | 0.945 | 0.0 | 0.0 | 1.8351 | 1.8404 | | hf_Bert | 4 | 1.0396 | 0.8545 | 0.928 | 0.0 | 2.0098 | 1.7521 | | hf_GPT2_large | 4 | 1.0002 | 0.9903 | 0.0 | 0.0 | 0.0 | 1.7504 | | hf_Bart | 4 | 1.009 | 0.8248 | 0.0 | 0.0 | 1.8018 | 1.7338 | | speech_transformer | 32 | 1.002 | 0.8354 | 0.0 | 0.0 | 1.7114 | 1.6954 | | timm_resnest | 32 | 1.0045 | 1.0195 | 0.8318 | 1.3124 | 1.9303 | 1.6815 | | timm_vision_transformer | 8 | 1.01 | 0.8464 | 1.7343 | 1.3458 | 3.2299 | 1.562 | | timm_efficientdet | 1 | 0.9799 | 0.8051 | 0.0 | 0.0 | 4.7029 | 1.5402 | | mobilenet_v2 | 96 | 0.9994 | 0.9885 | 0.7643 | 0.9252 | 1.5653 | 1.5189 | | attention_is_all_you_need_pytorch | 256 | 1.0066 | 0.9055 | 0.0 | 0.0 | 1.5082 | 1.4711 | | hf_DistilBert | 8 | 1.0024 | 0.973 | 0.7311 | 0.0 | 1.4674 | 1.4463 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 1.0086 | 0.9708 | 1.3407 | 1.6819 | 1.4435 | | mobilenet_v3_large | 32 | 1.0035 | 1.0114 | 1.6301 | 1.418 | 3.029 | 1.4347 | | timm_nfnet | 128 | 0.9991 | 1.0006 | 0.0 | 1.1738 | 1.501 | 1.4297 | | fastNLP_Bert | 6 | 0.9989 | 0.8938 | 0.7673 | 0.0 | 1.4755 | 1.4176 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9983 | 0.9198 | 1.4546 | 1.2123 | 2.1261 | 1.3981 | | functorch_dp_cifar10 | 64 | 1.0013 | 0.9154 | 2.3988 | 1.1866 | 4.9158 | 1.3896 | | mnasnet1_0 | 32 | 0.9979 | 1.0035 | 1.2502 | 1.3985 | 2.6088 | 1.3719 | | resnet50 | 32 | 1.0026 | 1.006 | 1.0265 | 1.3597 | 1.7778 | 1.3676 | | densenet121 | 4 | 1.0025 | 0.9068 | 2.5239 | 1.3635 | 6.634 | 1.3397 | | pytorch_unet | 1 | 0.9995 | 0.9927 | 0.863 | 1.1545 | 1.3442 | 1.3154 | | LearningToPaint | 96 | 1.0019 | 0.9981 | 1.1493 | 1.359 | 1.8583 | 1.309 | | squeezenet1_1 | 32 | 1.003 | 0.9426 | 1.4482 | 1.1759 | 2.4444 | 1.29 | | timm_efficientnet | 32 | 0.9584 | 0.8099 | 1.0738 | 1.1802 | 2.1108 | 1.2825 | | resnext50_32x4d | 8 | 1.0019 | 0.9491 | 1.8428 | 1.3362 | 3.5086 | 1.269 | | vgg16 | 64 | 0.9996 | 0.9981 | 0.8582 | 0.9949 | 1.2709 | 1.2657 | | pytorch_stargan | 16 | 0.9944 | 1.0237 | 0.9666 | 1.0868 | 1.3488 | 1.2632 | | pytorch_struct | 200 | 0.9891 | 0.7371 | 1.0074 | 1.051 | 2.1098 | 1.2601 | | Super_SloMo | 6 | 1.0001 | 0.9954 | 0.887 | 0.0 | 1.2879 | 1.2577 | | resnet18 | 16 | 1.0014 | 0.9898 | 1.5783 | 1.3409 | 2.954 | 1.2486 | | timm_regnet | 32 | 0.9833 | 0.9338 | 0.8864 | 1.1838 | 1.2927 | 1.2336 | | alexnet | 128 | 0.9987 | 0.9981 | 0.8156 | 1.0033 | 1.2128 | 1.2096 | | Background_Matting | 4 | 0.9995 | 1.016 | 0.895 | 1.1126 | 1.2216 | 1.2055 | | drq | 1 | 1.0052 | 0.8044 | 1.6843 | 1.1378 | 3.0048 | 1.1715 | | hf_Reformer | 4 | 0.9954 | 0.999 | 0.9446 | 0.0 | 1.1587 | 1.1538 | | timm_vovnet | 32 | 0.9189 | 0.8867 | 0.8556 | 1.1162 | 1.2846 | 1.1406 | | timm_vision_transformer_large | 8 | 1.0 | 0.9905 | 0.0 | 0.9928 | 1.1564 | 1.1334 | | yolov3 | 16 | 1.0004 | 0.9903 | 0.8038 | 0.9306 | 1.1044 | 1.0799 | | lennard_jones | 1000 | 0.9714 | 0.7434 | 1.2671 | 1.0511 | 2.088 | 1.0736 | | dcgan | 32 | 0.9851 | 0.9098 | 1.6229 | 0.727 | 2.5611 | 1.0728 | | soft_actor_critic | 256 | 0.9908 | 0.7429 | 1.3292 | 1.0647 | 1.7305 | 1.0396 | | hf_BigBird | 2 | 0.9936 | 0.9157 | 1.0555 | 0.0 | 1.1515 | 1.0361 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9962 | 0.6971 | 0.979 | 0.9895 | 1.0298 | | tts_angular | 64 | 0.9591 | 0.9356 | 0.9872 | 0.9955 | 1.0077 | 1.0252 | | demucs | 4 | 1.0001 | 1.0008 | 1.0017 | 1.0039 | 1.0004 | 0.9973 | | dlrm | 2048 | 1.1136 | 1.0706 | 0.0 | 0.0 | 0.0 | 0.9776 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | yolov3 | 16 | 2.923 | 9.7089 | 13.516 | 39.5674 | 442.3717 | 402.8973 | | hf_T5_large | 2 | 12.9718 | 48.602 | nan | nan | 228.4119 | 115.6282 | | timm_efficientdet | 1 | 19.6178 | 42.7358 | nan | nan | 465.5501 | 92.4418 | | hf_GPT2_large | 4 | 5.3147 | 24.5289 | nan | nan | nan | 69.9105 | | timm_vision_transformer_large | 8 | 2.7963 | 19.039 | nan | 38.1465 | 127.5698 | 52.2793 | | densenet121 | 4 | 2.1156 | 15.3847 | 22.9485 | 124.2562 | 51.3158 | 49.6844 | | timm_nfnet | 128 | 1.9394 | 8.3928 | nan | 37.4136 | 31.9275 | 30.3265 | | speech_transformer | 32 | 1.8686 | 11.1471 | nan | nan | 153.5341 | 29.0602 | | hf_BigBird | 2 | 7.995 | 16.5311 | 37.0042 | nan | 48.5517 | 28.6978 | | BERT_pytorch | 16 | 1.681 | 9.8188 | nan | nan | 104.329 | 28.6499 | | hf_Bart | 4 | 1.7165 | 11.0239 | nan | nan | 56.2143 | 26.5941 | | hf_T5 | 8 | 2.1584 | 10.6262 | nan | nan | 52.43 | 25.8487 | | fastNLP_Bert | 6 | 1.7279 | 8.9622 | 13.5539 | nan | 72.7087 | 24.9241 | | timm_regnet | 32 | 2.2887 | 10.0729 | 24.6471 | 59.6018 | 26.1219 | 24.8306 | | attention_is_all_you_need_pytorch | 256 | 1.2898 | 9.299 | nan | nan | 151.7044 | 23.7492 | | timm_efficientnet | 32 | 1.7559 | 7.8059 | 18.1243 | 68.4276 | 24.1677 | 22.8523 | | hf_Bert | 4 | 1.575 | 8.7208 | 12.1287 | nan | 34.2626 | 22.2991 | | hf_GPT2 | 4 | 1.4288 | 7.8781 | 11.3481 | nan | 67.9973 | 20.003 | | shufflenet_v2_x1_0 | 128 | 0.9449 | 6.2344 | 8.8631 | 37.1219 | 20.981 | 19.7286 | | mobilenet_v3_large | 32 | 0.8985 | 6.0447 | 8.2574 | 71.7018 | 33.405 | 19.7087 | | Background_Matting | 4 | 0.8973 | 5.7766 | 7.9597 | 42.6476 | 19.9451 | 18.947 | | Super_SloMo | 6 | 1.0495 | 5.9627 | 7.8978 | nan | 19.6864 | 18.8643 | | mobilenet_v2 | 96 | 0.7945 | 5.3465 | 7.8946 | 40.0719 | 19.8713 | 18.229 | | hf_Albert | 8 | 1.3084 | 8.1272 | 11.9952 | nan | 48.3702 | 17.2142 | | mnasnet1_0 | 32 | 0.8019 | 5.3211 | 7.4705 | 42.8673 | 31.4894 | 17.2027 | | timm_vovnet | 32 | 1.4771 | 5.3249 | 11.6258 | 30.1357 | 18.0646 | 17.0872 | | resnet50 | 32 | 0.8485 | 5.7615 | 8.0429 | 40.4635 | 17.6785 | 16.6779 | | resnext50_32x4d | 8 | 0.8793 | 5.7376 | 7.9595 | 35.6892 | 30.5601 | 16.3652 | | timm_vision_transformer | 8 | 0.9322 | 5.829 | 7.7951 | 13.8935 | 132.5076 | 16.2786 | | hf_Reformer | 4 | 2.4679 | 5.2911 | 9.7317 | nan | 36.0017 | 13.9363 | | timm_resnest | 32 | 0.5771 | 3.2137 | 4.4248 | 42.3999 | 122.2678 | 11.9753 | | hf_DistilBert | 8 | 0.5984 | 4.0834 | 8.17 | nan | 20.8504 | 11.6312 | | functorch_dp_cifar10 | 64 | 0.373 | 2.3811 | 3.266 | 6.1885 | 27.488 | 9.6611 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4183 | 2.7507 | 3.5918 | 4.6509 | 9.2553 | 8.877 | | pytorch_unet | 1 | 0.4465 | 2.5453 | 3.5729 | 25.8944 | 9.434 | 8.7277 | | resnet18 | 16 | 0.4111 | 2.2097 | 3.3 | 22.8777 | 22.4984 | 7.833 | | LearningToPaint | 96 | 0.4329 | 2.2982 | 3.2991 | 30.2571 | 8.2768 | 7.6915 | | pytorch_stargan | 16 | 0.3986 | 2.634 | 3.5008 | 6.7975 | 102.4642 | 6.8949 | | squeezenet1_1 | 32 | 0.2615 | 1.3857 | 1.9036 | 6.4523 | 4.9767 | 4.7825 | | vgg16 | 64 | 0.2087 | 0.9465 | 1.4011 | 3.4915 | 4.3809 | 3.6518 | | drq | 1 | 0.1568 | 0.6445 | 1.0458 | 4.3191 | 4.1117 | 3.4548 | | pytorch_struct | 200 | 0.2711 | 1.1176 | 1.7622 | 5.3086 | 99.742 | 3.3771 | | dlrm | 2048 | 0.471 | 1.018 | nan | nan | nan | 3.1767 | | alexnet | 128 | 0.1657 | 0.6086 | 0.8834 | 3.1333 | 3.4741 | 2.9053 | | soft_actor_critic | 256 | 0.2101 | 0.4298 | 0.6681 | 1.9993 | 3.4273 | 2.7294 | | dcgan | 32 | 0.17 | 0.5153 | 0.7625 | 4.1727 | 3.03 | 2.5 | | nvidia_deeprecommender | 256 | 0.2071 | 0.6086 | 0.8996 | 2.8956 | 4.6821 | 2.2187 | | lennard_jones | 1000 | 0.1599 | 0.4314 | 0.6112 | 1.4535 | 2.1934 | 1.9312 | | tts_angular | 64 | 0.2324 | 0.2957 | 0.4182 | 1.0314 | 1.9835 | 1.4594 | | demucs | 4 | 0.3603 | 0.3485 | 0.3495 | 0.3509 | 0.2726 | 0.2585 | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | hf_Albert | 8 | 0.9814 | 0.936 | 0.3273 | nan | 1.1576 | 1.5688 | | timm_efficientdet | 1 | 1.028 | 0.8404 | nan | nan | 1.0226 | 1.4663 | | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2716 | 0.7887 | 1.2758 | 1.3156 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.9309 | 1.2564 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | nan | nan | 0.9728 | 1.2033 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 0.9829 | 1.1666 | | hf_GPT2 | 4 | 0.9706 | 0.8625 | 0.3688 | nan | 0.9648 | 1.153 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4541 | nan | 0.8098 | 1.1522 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8661 | 1.1505 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0536 | 1.1475 | | hf_GPT2_large | 4 | 0.9582 | 0.8645 | nan | nan | nan | 1.1364 | | timm_nfnet | 128 | 0.9693 | 0.8982 | nan | 0.9445 | 1.0337 | 1.1245 | | speech_transformer | 32 | 1.0048 | 0.9174 | nan | nan | 0.9066 | 1.118 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3921 | 0.8871 | 0.7151 | 1.0538 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3483 | 0.8623 | 0.8756 | 1.053 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9149 | 0.7295 | 1.0367 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | 0.8452 | 0.4478 | 1.0327 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8358 | nan | 0.8494 | 0.879 | 1.0239 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9117 | 1.0074 | 1.0232 | | hf_Bert | 4 | 0.9844 | 0.8677 | 0.3806 | nan | 0.9017 | 1.0046 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0002 | 0.9895 | 1.0002 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 1.0967 | 0.564 | 0.9991 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9874 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_Bart | 4 | 0.9102 | 0.8321 | nan | nan | 0.8137 | 0.9871 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4253 | 0.8882 | 0.7783 | 0.9807 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | 0.9726 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8735 | 0.4234 | 0.8441 | 0.8863 | 0.9721 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3229 | nan | 0.8387 | 0.9717 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3667 | 0.8376 | 0.8753 | 0.9535 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3882 | 0.8176 | 0.7644 | 0.9518 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3446 | 0.866 | 0.7918 | 0.9343 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.85 | 0.9249 | 0.9292 | | Background_Matting | 4 | 1.0146 | 0.9624 | 0.3723 | 0.9813 | 0.9245 | 0.9292 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8308 | 0.752 | 0.9256 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3536 | 0.8244 | 0.9059 | 0.9109 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.7753 | 0.7974 | 0.9099 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3563 | 0.7995 | 0.865 | 0.9026 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3459 | 0.7589 | 0.8611 | 0.8951 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3573 | 0.8503 | 0.856 | 0.8927 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.341 | 0.8207 | 0.7541 | 0.8749 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7073 | 0.8283 | 0.8738 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.8678 | 0.8715 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3941 | 0.7276 | 0.6102 | 0.8568 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3407 | 0.7742 | 0.8352 | 0.8469 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3827 | 0.6722 | 0.7295 | 0.8017 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.7172 | 0.7491 | 0.7534 | | dlrm | 2048 | 0.7302 | 0.7306 | nan | nan | nan | 0.7306 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | MT5ForConditionalGeneration | 2 | 1.0212 | 0.8537 | 0.0 | 0.0 | 5.8401 | 2.4169 | | GPT2ForSequenceClassification | 4 | 1.0005 | 0.977 | 0.0 | 0.0 | 2.1528 | 2.1163 | | DistillGPT2 | 1 | 1.0292 | 0.8786 | 1.2513 | 0.0 | 2.8593 | 2.0686 | | OPTForCausalLM | 4 | 1.0041 | 0.8281 | 1.801 | 0.0 | 4.0683 | 1.9383 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.98 | 0.7683 | 0.0 | 1.9566 | 1.9104 | | PLBartForConditionalGeneration | 8 | 1.0152 | 0.831 | 1.485 | 0.0 | 2.9117 | 1.8791 | | MegatronBertForQuestionAnswering | 8 | 1.0325 | 0.8479 | 1.4804 | 0.0 | 2.7143 | 1.8681 | | MBartForConditionalGeneration | 8 | 1.0133 | 0.833 | 1.29 | 0.0 | 2.425 | 1.8204 | | TrOCRForCausalLM | 8 | 1.0122 | 0.8299 | 0.0 | 0.0 | 2.0315 | 1.7879 | | Speech2Text2ForCausalLM | 64 | 1.0089 | 0.8016 | 0.9604 | 0.0 | 2.2801 | 1.7876 | | CamemBert | 1 | 1.0359 | 0.8456 | 1.7367 | 0.0 | 3.5883 | 1.7784 | | BartForConditionalGeneration | 1 | 1.0127 | 0.8451 | 0.0 | 0.0 | 1.835 | 1.7714 | | XGLMForCausalLM | 1 | 1.0128 | 0.8061 | 0.0 | 0.0 | 3.2516 | 1.7645 | | RobertaForCausalLM | 4 | 1.037 | 0.8408 | 1.9854 | 0.0 | 4.0998 | 1.7565 | | MegatronBertForCausalLM | 2 | 1.0316 | 0.851 | 2.1788 | 0.0 | 4.2892 | 1.7409 | | MobileBertForQuestionAnswering | 32 | 1.0206 | 0.8216 | 0.0 | 0.0 | 5.4787 | 1.7407 | | DistilBertForMaskedLM | 16 | 1.0314 | 0.8586 | 1.035 | 0.0 | 2.3368 | 1.7378 | | MobileBertForMaskedLM | 16 | 1.017 | 0.8202 | 0.0 | 0.0 | 5.9124 | 1.7318 | | DistilBertForQuestionAnswering | 32 | 1.0318 | 0.8387 | 0.8926 | 0.0 | 1.8261 | 1.7123 | | ElectraForCausalLM | 1 | 1.0333 | 0.8518 | 2.6494 | 0.0 | 6.9167 | 1.7048 | | LayoutLMForSequenceClassification | 16 | 1.0003 | 0.9807 | 0.7766 | 0.0 | 1.7372 | 1.6987 | | PegasusForConditionalGeneration | 4 | 1.0149 | 0.8276 | 1.6544 | 0.0 | 3.3423 | 1.6961 | | BlenderbotSmallForConditionalGeneration | 32 | 1.0129 | 0.8855 | 0.0 | 0.0 | 1.8344 | 1.6928 | | PegasusForCausalLM | 8 | 1.0125 | 0.8143 | 1.0511 | 0.0 | 1.9151 | 1.6918 | | YituTechConvBert | 1 | 1.0227 | 0.8902 | 0.0 | 0.0 | 5.5641 | 1.6775 | | AlbertForQuestionAnswering | 2 | 1.0007 | 0.8084 | 0.0 | 0.0 | 1.6676 | 1.6491 | | AlbertForMaskedLM | 2 | 1.0004 | 0.8083 | 0.0 | 0.0 | 1.6609 | 1.6461 | | XLNetLMHeadModel | 4 | 0.9982 | 0.9679 | 0.0 | 0.0 | 1.6195 | 1.6285 | | M2M100ForConditionalGeneration | 2 | 1.0099 | 0.8836 | 2.0714 | 0.0 | 3.9054 | 1.615 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9399 | 0.0 | 0.0 | 1.6013 | 1.5996 | | LayoutLMForMaskedLM | 16 | 1.0004 | 0.9711 | 0.7533 | 0.0 | 1.5855 | 1.5656 | | PLBartForCausalLM | 16 | 1.0135 | 0.9475 | 0.927 | 0.0 | 1.5381 | 1.5062 | | BartForCausalLM | 2 | 1.0018 | 0.9647 | 0.7407 | 0.0 | 1.499 | 1.4516 | | T5Small | 1 | 1.0245 | 0.8784 | 0.0 | 0.0 | 1.7858 | 1.4468 | | MBartForCausalLM | 16 | 1.0152 | 0.9072 | 0.0 | 0.0 | 1.4077 | 1.4182 | | BertForQuestionAnswering | 64 | 1.0015 | 0.9675 | 0.7679 | 0.0 | 1.4346 | 1.3999 | | RobertaForQuestionAnswering | 64 | 1.0 | 0.9709 | 0.77 | 0.0 | 1.4356 | 1.3965 | | BertForMaskedLM | 64 | 0.9999 | 0.9578 | 0.7351 | 0.0 | 1.3295 | 1.317 | | BlenderbotSmallForCausalLM | 64 | 1.0016 | 0.9217 | 0.7015 | 0.0 | 1.3074 | 1.312 | | DebertaForMaskedLM | 4 | 0.9364 | 0.737 | 0.8416 | 0.0 | 1.245 | 1.2378 | | DebertaForQuestionAnswering | 4 | 0.9309 | 0.7348 | 0.9413 | 0.0 | 1.4976 | 1.2299 | | BigBird | 1 | 0.9913 | 0.9116 | 1.0399 | 0.0 | 1.1587 | 1.0251 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | MobileBertForMaskedLM | 16 | 9.0195 | 42.6022 | nan | nan | 121.1753 | 76.0038 | | MobileBertForQuestionAnswering | 32 | 8.9885 | 39.986 | nan | nan | 103.8671 | 74.0658 | | XLNetLMHeadModel | 4 | 3.7491 | 24.4324 | nan | nan | 173.4133 | 59.9689 | | MBartForConditionalGeneration | 8 | 3.4406 | 21.6843 | 33.6119 | nan | 69.3267 | 51.0072 | | PegasusForConditionalGeneration | 4 | 3.2658 | 21.0318 | 33.0349 | nan | 84.6794 | 50.1654 | | M2M100ForConditionalGeneration | 2 | 3.4775 | 20.0169 | 27.9012 | nan | 88.9362 | 50.0348 | | MegatronBertForCausalLM | 2 | 3.4485 | 18.0977 | 26.8718 | nan | 74.6988 | 47.3342 | | BartForConditionalGeneration | 1 | 3.364 | 21.4216 | nan | nan | 57.7421 | 46.8986 | | MegatronBertForQuestionAnswering | 8 | 3.5021 | 17.8618 | 26.1531 | nan | 67.3498 | 44.8143 | | XGLMForCausalLM | 1 | 2.7292 | 16.5557 | nan | nan | 69.2599 | 41.0403 | | MT5ForConditionalGeneration | 2 | 3.3471 | 16.2423 | nan | nan | 104.9205 | 37.5933 | | DebertaForMaskedLM | 4 | 4.8862 | 12.3834 | 54.0419 | nan | 130.5489 | 35.1018 | | DebertaForQuestionAnswering | 4 | 4.9971 | 12.3246 | 47.0074 | nan | 99.3878 | 34.2262 | | BlenderbotSmallForConditionalGeneration | 32 | 2.1737 | 14.0085 | nan | nan | 57.6438 | 33.8027 | | YituTechConvBert | 1 | 2.4988 | 13.456 | nan | nan | 119.0895 | 31.7395 | | BigBird | 1 | 7.9649 | 16.4584 | 37.3619 | nan | 48.9328 | 29.3896 | | PLBartForConditionalGeneration | 8 | 1.7594 | 11.1994 | 15.7536 | nan | 63.1955 | 26.6755 | | T5ForConditionalGeneration | 4 | 2.0978 | 10.6908 | nan | nan | 63.019 | 25.8627 | | T5Small | 1 | 2.1081 | 10.7492 | nan | nan | 60.4685 | 25.1827 | | LayoutLMForSequenceClassification | 16 | 1.7653 | 9.3441 | 12.9331 | nan | 47.9953 | 23.7189 | | LayoutLMForMaskedLM | 16 | 1.8121 | 9.0929 | 13.1247 | nan | 37.3037 | 22.7907 | | RobertaForCausalLM | 4 | 1.6561 | 8.9255 | 12.2233 | nan | 62.9074 | 22.65 | | ElectraForQuestionAnswering | 64 | 1.6063 | 8.8847 | 12.3916 | nan | 37.068 | 22.3866 | | BertForMaskedLM | 64 | 1.5424 | 8.6723 | 12.2396 | nan | 39.9349 | 22.0051 | | ElectraForCausalLM | 1 | 1.6779 | 8.8162 | 11.9966 | nan | 25.5091 | 21.8899 | | CamemBert | 1 | 1.6319 | 8.7693 | 12.0634 | nan | 25.0396 | 21.4368 | | BertForQuestionAnswering | 64 | 1.6921 | 8.6988 | 12.1721 | nan | 23.0439 | 21.2348 | | PegasusForCausalLM | 8 | 1.2966 | 7.9649 | 11.9687 | nan | 44.3083 | 21.1057 | | MBartForCausalLM | 16 | 1.2299 | 7.8781 | nan | nan | 33.5959 | 20.9377 | | RobertaForQuestionAnswering | 64 | 1.6109 | 8.9447 | 12.3986 | nan | 23.1424 | 20.6543 | | GPT2ForSequenceClassification | 4 | 1.4262 | 8.0483 | nan | nan | 34.5897 | 20.232 | | AlbertForMaskedLM | 2 | 1.4638 | 8.4092 | nan | nan | 34.9008 | 19.8435 | | BartForCausalLM | 2 | 1.2531 | 7.9796 | 11.8826 | nan | 32.8441 | 19.6195 | | TrOCRForCausalLM | 8 | 1.2602 | 8.0204 | nan | nan | 25.7557 | 19.4415 | | OPTForCausalLM | 4 | 1.3439 | 8.1321 | 20.022 | nan | 39.4153 | 19.0927 | | AlbertForQuestionAnswering | 2 | 1.4795 | 8.3655 | nan | nan | 18.0915 | 17.1831 | | BlenderbotSmallForCausalLM | 64 | 0.7727 | 5.7176 | 7.7626 | nan | 27.7887 | 14.1211 | | DistilBertForMaskedLM | 16 | 0.6189 | 4.172 | 8.2951 | nan | 22.6199 | 12.0578 | | DistilBertForQuestionAnswering | 32 | 0.6611 | 4.2243 | 8.2315 | nan | 34.9369 | 11.7694 | | Speech2Text2ForCausalLM | 64 | 0.6848 | 4.1527 | 6.7412 | nan | 27.6725 | 11.7505 | | PLBartForCausalLM | 16 | 0.6324 | 4.1172 | 6.1244 | nan | 26.6818 | 11.4105 | | DistillGPT2 | 1 | 0.7388 | 3.9171 | 5.6422 | nan | 28.1078 | 10.5911 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ | AlbertForQuestionAnswering | 2 | 1.0 | 0.6451 | nan | nan | 0.9124 | 1.44 | | AlbertForMaskedLM | 2 | 1.0 | 0.6364 | nan | nan | 0.8977 | 1.4235 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.995 | 1.2292 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9874 | 1.1703 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | nan | 1.0779 | 1.1635 | | DebertaForQuestionAnswering | 4 | 0.9792 | 1.0574 | 0.3598 | nan | 0.3761 | 1.1472 | | BartForConditionalGeneration | 1 | 1.0 | 0.8619 | nan | nan | 0.9894 | 1.1415 | | BartForCausalLM | 2 | 1.0 | 0.8769 | 0.3797 | nan | 1.0442 | 1.1204 | | BigBird | 1 | 1.0008 | 0.9547 | 0.4481 | nan | 0.835 | 1.1192 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3623 | nan | 0.4498 | 1.1125 | | BlenderbotSmallForConditionalGeneration | 32 | 0.9998 | 0.8996 | nan | nan | 0.9557 | 1.1008 | | Speech2Text2ForCausalLM | 64 | 0.969 | 0.8488 | 0.3578 | nan | 0.9452 | 1.075 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9938 | 1.0704 | | MobileBertForMaskedLM | 16 | 0.9985 | 0.8983 | nan | nan | 0.6948 | 1.0683 | | MBartForConditionalGeneration | 8 | 0.9999 | 0.8187 | 0.4121 | nan | 0.8861 | 1.0626 | | RobertaForQuestionAnswering | 64 | 0.9996 | 0.9315 | 0.3686 | nan | 0.9946 | 1.0621 | | BertForQuestionAnswering | 64 | 0.9995 | 0.9315 | 0.3686 | nan | 0.9946 | 1.0621 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0056 | 1.0614 | | DistilBertForQuestionAnswering | 32 | 0.9992 | 0.8965 | 0.376 | nan | 0.8639 | 1.0584 | | DistillGPT2 | 1 | 0.9963 | 0.7527 | 0.3884 | nan | 0.8288 | 1.0545 | | MBartForCausalLM | 16 | 1.0 | 0.8398 | nan | nan | 0.9567 | 1.0451 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3597 | nan | 0.9269 | 1.0441 | | PegasusForCausalLM | 8 | 0.999 | 0.9444 | 0.4647 | nan | 0.8445 | 1.0404 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3629 | nan | 0.9811 | 1.0365 | | PegasusForConditionalGeneration | 4 | 0.9994 | 0.9194 | 0.4621 | nan | 0.7686 | 1.0358 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3549 | nan | 0.9871 | 1.0264 | | PLBartForConditionalGeneration | 8 | 0.9975 | 0.8294 | 0.3984 | nan | 0.8438 | 1.0221 | | OPTForCausalLM | 4 | 0.9974 | 0.75 | 0.3898 | nan | 0.8483 | 1.019 | | TrOCRForCausalLM | 8 | 1.0 | 0.7955 | nan | nan | 0.8774 | 1.0171 | | DistilBertForMaskedLM | 16 | 0.9986 | 0.8686 | 0.3662 | nan | 0.9164 | 1.0168 | | PLBartForCausalLM | 16 | 1.0001 | 0.8666 | 0.3854 | nan | 0.9395 | 1.013 | | ElectraForCausalLM | 1 | 0.9993 | 0.8955 | 0.3766 | nan | 0.6701 | 1.011 | | XLNetLMHeadModel | 4 | 0.9912 | 0.8791 | nan | nan | 1.0109 | 1.0109 | | CamemBert | 1 | 0.9989 | 0.7872 | 0.4083 | nan | 0.8654 | 1.0095 | | M2M100ForConditionalGeneration | 2 | 0.9997 | 0.9659 | 0.5099 | nan | 0.7118 | 1.0048 | | XGLMForCausalLM | 1 | 1.0 | 0.999 | nan | nan | 0.7913 | 1.0 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.8618 | 0.9718 | | RobertaForCausalLM | 4 | 0.9237 | 0.7741 | 0.4183 | nan | 0.8574 | 0.9237 | | MegatronBertForQuestionAnswering | 8 | 0.9051 | 0.8218 | 0.4331 | nan | 0.8434 | 0.9051 | | MobileBertForQuestionAnswering | 32 | 1.0142 | 0.9796 | nan | nan | 0.6265 | 0.8395 | | MegatronBertForCausalLM | 2 | 0.7726 | 0.7726 | 0.4464 | nan | 0.7726 | 0.7726 | | MT5ForConditionalGeneration | 2 | 0.6019 | 0.6019 | nan | nan | 0.6019 | 0.6019 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | tnt_s_patch16_224 | 64 | 0.9998 | 0.996 | 0.0 | 1.8927 | 2.0669 | 2.0166 | | xcit_large_24_p8_224 | 5 | 1.0001 | 0.0 | 0.0 | 0.0 | 2.1732 | 1.7693 | | cait_m36_384 | 2 | 1.0031 | 0.8444 | 0.0 | 1.3666 | 2.5088 | 1.7267 | | ghostnet_100 | 128 | 1.0023 | 0.9967 | 0.8918 | 1.5678 | 2.0488 | 1.7165 | | lcnet_050 | 128 | 0.9697 | 0.9507 | 0.8689 | 1.6016 | 2.0068 | 1.6265 | | nfnet_l0 | 64 | 1.008 | 0.8316 | 0.8087 | 1.1327 | 1.7737 | 1.6092 | | gmixer_24_224 | 64 | 0.9991 | 0.8834 | 0.0 | 1.0025 | 1.665 | 1.6079 | | resnest101e | 32 | 1.0029 | 0.9792 | 1.1035 | 1.427 | 2.3567 | 1.5816 | | volo_d1_224 | 64 | 0.9999 | 0.9942 | 0.0 | 1.1417 | 1.6044 | 1.5643 | | twins_pcpvt_base | 32 | 1.0042 | 0.8954 | 1.3418 | 1.342 | 2.4281 | 1.5564 | | crossvit_9_240 | 64 | 1.0069 | 0.961 | 0.8776 | 1.1151 | 1.5854 | 1.5195 | | swin_base_patch4_window7_224 | 64 | 1.0001 | 0.961 | 0.0 | 1.0368 | 1.5142 | 1.5033 | | dla102 | 64 | 1.0002 | 0.9865 | 0.8131 | 1.3833 | 1.5367 | 1.4816 | | gluon_inception_v3 | 128 | 0.9999 | 0.9963 | 0.8525 | 1.1958 | 1.5059 | 1.473 | | inception_v3 | 128 | 1.0 | 0.9966 | 0.8527 | 1.1968 | 1.4976 | 1.4686 | | adv_inception_v3 | 128 | 1.0 | 0.9966 | 0.8534 | 1.1962 | 1.5057 | 1.4672 | | mnasnet_100 | 128 | 0.9546 | 0.9419 | 0.787 | 1.3716 | 1.4338 | 1.4598 | | regnety_002 | 128 | 0.9774 | 0.9311 | 1.1255 | 1.3889 | 2.0652 | 1.4516 | | mobilenetv3_large_100 | 128 | 0.9562 | 0.945 | 0.7833 | 1.3447 | 1.4623 | 1.4415 | | mobilenetv2_100 | 128 | 0.9514 | 0.9415 | 0.7209 | 0.8657 | 1.3996 | 1.4347 | | dm_nfnet_f0 | 128 | 0.9979 | 0.9994 | 0.0 | 1.1787 | 1.5001 | 1.425 | | mobilevit_s | 32 | 0.9757 | 0.8152 | 0.7906 | 1.2165 | 1.6643 | 1.4127 | | fbnetv3_b | 128 | 0.9531 | 0.9407 | 0.8011 | 1.2581 | 1.3993 | 1.4007 | | spnasnet_100 | 128 | 0.9474 | 0.9363 | 0.7762 | 1.3173 | 1.3748 | 1.3962 | | selecsls42b | 128 | 0.9998 | 0.9957 | 0.8407 | 1.3536 | 1.4213 | 1.3928 | | coat_lite_mini | 128 | 0.9999 | 0.9887 | 0.842 | 1.2191 | 1.4288 | 1.3914 | | fbnetc_100 | 128 | 0.9535 | 0.9438 | 0.7916 | 1.3691 | 1.3581 | 1.3752 | | resmlp_12_224 | 128 | 1.0001 | 0.9978 | 0.7826 | 0.0 | 1.4146 | 1.3686 | | jx_nest_base | 32 | 0.9998 | 0.9932 | 0.0 | 1.2254 | 1.4003 | 1.3651 | | ese_vovnet19b_dw | 128 | 0.9703 | 0.9622 | 0.7669 | 1.2466 | 1.3586 | 1.3599 | | tf_efficientnet_b0 | 128 | 0.966 | 0.8081 | 0.6673 | 1.0976 | 1.3531 | 1.3554 | | pit_b_224 | 64 | 0.9998 | 0.9949 | 0.8216 | 1.0623 | 1.3594 | 1.3516 | | res2net101_26w_4s | 64 | 1.0038 | 0.9889 | 0.9597 | 1.4095 | 1.6442 | 1.3485 | | cspdarknet53 | 64 | 0.9431 | 0.9342 | 0.7564 | 0.9036 | 1.3312 | 1.3469 | | botnet26t_256 | 128 | 0.98 | 0.9733 | 0.8127 | 1.3475 | 1.316 | 1.3277 | | hrnet_w18 | 2 | 1.0054 | 0.9669 | 2.3282 | 1.3989 | 5.0848 | 1.3048 | | res2next50 | 2 | 1.0025 | 0.9057 | 2.2905 | 1.3259 | 5.4981 | 1.3024 | | res2net50_14w_8s | 2 | 1.0013 | 0.9122 | 2.3278 | 1.3861 | 5.8804 | 1.2998 | | poolformer_m36 | 64 | 0.9999 | 0.9981 | 0.8065 | 0.0 | 1.3293 | 1.2964 | | convit_base | 32 | 0.9997 | 0.9929 | 0.0 | 0.0 | 1.34 | 1.2899 | | rexnet_100 | 128 | 0.9651 | 0.8514 | 0.6903 | 1.0315 | 1.2792 | 1.2774 | | tinynet_a | 128 | 0.9716 | 0.8026 | 0.6506 | 1.0894 | 1.2557 | 1.2721 | | pnasnet5large | 16 | 1.0049 | 1.0268 | 0.8387 | 1.1277 | 1.2934 | 1.2655 | | beit_base_patch16_224 | 64 | 1.0001 | 0.9792 | 0.0 | 1.0445 | 1.2859 | 1.2652 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9926 | 0.7957 | 1.0626 | 1.2828 | 1.2638 | | mixer_b16_224 | 64 | 1.0002 | 0.9921 | 0.7967 | 0.9572 | 1.285 | 1.2435 | | eca_botnext26ts_256 | 64 | 0.962 | 0.8009 | 0.658 | 1.1046 | 1.2516 | 1.2415 | | sebotnet33ts_256 | 64 | 0.9666 | 0.8371 | 0.6809 | 1.1174 | 1.2055 | 1.2087 | | mixnet_l | 64 | 0.9815 | 0.8872 | 0.8056 | 1.1028 | 1.2378 | 1.1824 | | tf_mixnet_l | 64 | 0.9839 | 0.9022 | 0.7936 | 1.0722 | 1.2388 | 1.1788 | | vit_base_patch16_224 | 64 | 1.0 | 0.994 | 0.8351 | 0.9939 | 1.196 | 1.1779 | | visformer_small | 128 | 0.9999 | 1.0019 | 0.8429 | 1.0838 | 1.2356 | 1.1765 | | eca_halonext26ts | 64 | 0.9638 | 0.8036 | 0.6645 | 1.102 | 0.0 | 1.1754 | | dpn107 | 32 | 0.9366 | 0.9289 | 0.7469 | 0.9873 | 1.1501 | 1.17 | | repvgg_a2 | 128 | 0.9439 | 0.9357 | 0.7981 | 1.1328 | 1.142 | 1.1575 | | gmlp_s16_224 | 64 | 1.0 | 0.9848 | 0.0 | 1.0463 | 1.3405 | 1.1304 | | gluon_xception65 | 32 | 1.0001 | 0.9898 | 0.7537 | 1.065 | 1.1607 | 1.1273 | | gernet_l | 128 | 0.947 | 0.936 | 0.7687 | 1.143 | 1.0697 | 1.0773 | | swsl_resnext101_32x16d | 32 | 0.9996 | 0.9811 | 0.8072 | 1.0762 | 1.1379 | 1.0588 | | convmixer_768_32 | 32 | 0.9999 | 0.9983 | 0.923 | 1.0533 | 1.0553 | 1.051 | | convnext_base | 32 | 1.0087 | 0.9445 | 0.0 | 1.3707 | 0.7366 | 0.7146 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_accuracy | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | jx_nest_base | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | pit_b_224 | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | twins_pcpvt_base | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | eca_halonext26ts | 64 | 1.435 | 6.1509 | 12.4327 | 65.9608 | nan | 220.5118 | | hrnet_w18 | 2 | 6.1168 | 38.0167 | 66.361 | 371.0315 | 107.8157 | 116.4564 | | pnasnet5large | 16 | 5.0919 | 27.3988 | 48.2146 | 189.0787 | 85.2776 | 80.574 | | xcit_large_24_p8_224 | 5 | 3.0668 | nan | nan | nan | 193.9676 | 68.8234 | | cait_m36_384 | 2 | 3.4055 | 24.6768 | nan | 61.1007 | 178.1442 | 62.3835 | | res2net101_26w_4s | 64 | 2.9449 | 20.561 | 33.6122 | 118.2269 | 68.2382 | 60.6978 | | resnest101e | 32 | 3.1821 | 20.4609 | 32.7293 | 101.181 | 142.8688 | 55.1254 | | twins_pcpvt_base | 32 | 2.9751 | 19.4392 | 30.1407 | 72.0773 | 570.865 | 54.9535 | | swin_base_patch4_window7_224 | 64 | 2.9795 | 16.0702 | nan | 71.9607 | 184.5472 | 51.8449 | | res2net50_14w_8s | 2 | 2.7966 | 18.1464 | 28.7877 | 103.2297 | 53.9375 | 50.1289 | | poolformer_m36 | 64 | 1.9704 | 10.502 | 16.1098 | nan | 50.2077 | 46.5752 | | dpn107 | 32 | 3.8719 | 16.8405 | 51.1831 | 99.5596 | 46.5371 | 43.1415 | | mobilevit_s | 32 | 1.8834 | 9.0856 | 17.8444 | 57.3713 | 435.559 | 42.5975 | | convnext_base | 32 | 1.5402 | 8.918 | nan | 36.8594 | 233.7388 | 42.2959 | | jx_nest_base | 32 | 1.8299 | 11.9719 | nan | 50.7765 | 154.2262 | 40.9813 | | fbnetv3_b | 128 | 3.1286 | 13.1082 | 35.9453 | 99.8172 | 42.4427 | 39.2793 | | adv_inception_v3 | 128 | 1.6951 | 10.7702 | 16.0534 | 98.6053 | 38.8197 | 36.9292 | | gluon_xception65 | 32 | 1.9051 | 13.0636 | 20.5942 | 63.092 | 39.6838 | 36.6068 | | tnt_s_patch16_224 | 64 | 1.9765 | 13.7438 | nan | 38.108 | 79.3178 | 36.349 | | tf_mixnet_l | 64 | 6.0485 | 15.4409 | 31.3271 | 80.4989 | 39.2934 | 35.3401 | | gluon_inception_v3 | 128 | 1.6463 | 10.6789 | 16.1246 | 99.5953 | 38.7047 | 35.3094 | | inception_v3 | 128 | 1.6443 | 10.7318 | 16.0762 | 100.0787 | 40.7121 | 35.1945 | | mixnet_l | 64 | 5.4554 | 14.3232 | 31.9029 | 80.7908 | 37.772 | 34.7109 | | ghostnet_100 | 128 | 2.8074 | 11.4277 | 16.0482 | 90.9446 | 37.1004 | 34.594 | | dla102 | 64 | 1.7372 | 11.9857 | 18.0077 | 85.2083 | 37.8686 | 34.4964 | | gmlp_s16_224 | 64 | 1.358 | 9.3349 | nan | 22.0471 | 94.7536 | 33.1031 | | volo_d1_224 | 64 | 1.4035 | 9.6409 | nan | 39.0205 | 99.4593 | 31.3828 | | swsl_resnext101_32x16d | 32 | 1.8603 | 12.0047 | 17.7091 | 51.8964 | 33.0098 | 30.0827 | | dm_nfnet_f0 | 128 | 2.0123 | 8.5217 | nan | 37.1719 | 33.4232 | 29.788 | | res2next50 | 2 | 1.6902 | 10.5555 | 14.9018 | 58.4687 | 31.3378 | 29.1112 | | crossvit_9_240 | 64 | 1.7674 | 10.7968 | 15.9577 | 36.4229 | 177.7272 | 28.7388 | | rexnet_100 | 128 | 1.9778 | 8.8454 | 20.415 | 116.8786 | 30.9752 | 28.5034 | | tinynet_a | 128 | 2.1854 | 9.7501 | 23.1673 | 78.7599 | 30.6411 | 27.887 | | sebotnet33ts_256 | 64 | 1.8121 | 7.2766 | 16.3418 | 67.8501 | 200.1795 | 27.2786 | | gmixer_24_224 | 64 | 1.543 | 10.4071 | nan | 28.2553 | 78.5994 | 24.9947 | | cspdarknet53 | 64 | 2.4964 | 8.9171 | 22.4576 | 39.8907 | 27.2623 | 24.7776 | | tf_efficientnet_b0 | 128 | 1.9292 | 8.2259 | 18.8205 | 77.8078 | 26.1743 | 23.675 | | fbnetc_100 | 128 | 2.1137 | 8.3903 | 20.2628 | 59.9796 | 25.3075 | 22.7453 | | coat_lite_mini | 128 | 1.101 | 6.8 | 10.0239 | 32.4766 | 402.4657 | 22.6502 | | spnasnet_100 | 128 | 2.065 | 7.8044 | 20.1666 | 56.6263 | 24.9484 | 22.0166 | | convit_base | 32 | 1.2693 | 7.6327 | nan | nan | 96.6373 | 21.9034 | | eca_botnext26ts_256 | 64 | 1.4337 | 5.6354 | 12.0384 | 63.0366 | 365.2177 | 21.8795 | | mobilenetv3_large_100 | 128 | 1.644 | 6.7536 | 15.1964 | 82.4359 | 23.6085 | 21.0345 | | nfnet_l0 | 64 | 1.7975 | 8.4924 | 12.6507 | 34.4242 | 26.6382 | 20.7187 | | botnet26t_256 | 128 | 1.3584 | 5.2753 | 11.1036 | 49.2952 | 189.7464 | 20.3718 | | convmixer_768_32 | 32 | 1.3199 | 7.9233 | 11.8415 | 17.8022 | 25.4867 | 19.8285 | | regnety_002 | 128 | 1.6717 | 7.1936 | 16.0403 | 56.1349 | 20.9255 | 19.1329 | | mobilenetv2_100 | 128 | 1.7791 | 6.384 | 15.2861 | 40.5512 | 21.9773 | 19.0758 | | gernet_l | 128 | 2.0374 | 7.463 | 18.659 | 44.8025 | 20.9234 | 19.0706 | | mnasnet_100 | 128 | 1.6803 | 6.5951 | 15.5774 | 50.207 | 20.9103 | 18.743 | | beit_base_patch16_224 | 64 | 1.2479 | 7.0291 | nan | 18.1129 | 38.3457 | 18.6593 | | repvgg_a2 | 128 | 2.0414 | 7.0386 | 17.5579 | 61.292 | 20.4612 | 18.3748 | | pit_b_224 | 64 | 1.0952 | 6.9745 | 10.0976 | 25.4079 | 87.9129 | 18.2073 | | visformer_small | 128 | 0.922 | 4.9434 | 7.4202 | 30.5163 | 93.8021 | 17.8298 | | deit_base_distilled_patch16_224 | 64 | 0.8452 | 6.0886 | 8.8188 | 14.258 | 40.8733 | 17.5893 | | resmlp_12_224 | 128 | 0.6117 | 3.9529 | 7.3898 | nan | 41.942 | 17.1676 | | selecsls42b | 128 | 0.7102 | 4.7117 | 6.9379 | 50.2083 | 19.065 | 16.9864 | | vit_base_patch16_224 | 64 | 0.9454 | 5.9644 | 8.425 | 13.8959 | 29.1978 | 16.8973 | | mixer_b16_224 | 64 | 0.6924 | 4.6145 | 7.8052 | 16.1793 | 43.3383 | 16.4071 | | lcnet_050 | 128 | 1.0362 | 3.9688 | 8.4529 | 38.2322 | 14.9745 | 13.6927 | | ese_vovnet19b_dw | 128 | 1.026 | 3.9109 | 7.9181 | 38.5395 | 14.9488 | 13.1984 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.7887 | 1.3707 | 1.4015 | | gmixer_24_224 | 64 | 0.9922 | 0.9494 | nan | 0.8991 | 1.2587 | 1.3627 | | gmlp_s16_224 | 64 | 0.9939 | 0.9623 | nan | 0.92 | 1.2405 | 1.3565 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | 1.1722 | 1.1607 | 1.2789 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3213 | 0.7354 | 0.745 | 1.2732 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.8179 | 0.9746 | 1.2244 | | mobilevit_s | 32 | 0.9926 | 0.7681 | 0.2757 | 0.787 | 1.1122 | 1.2217 | | eca_botnext26ts_256 | 64 | 0.989 | 0.7706 | 0.2697 | 0.7788 | 1.1084 | 1.2042 | | eca_halonext26ts | 64 | 0.9885 | 0.775 | 0.2697 | 0.7792 | nan | 1.1991 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.8392 | 1.173 | 1.1918 | | convit_base | 32 | 0.9972 | 0.8582 | nan | nan | 1.0248 | 1.1823 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | 0.8648 | 1.1475 | 1.1687 | | tnt_s_patch16_224 | 64 | 0.9948 | 0.9668 | nan | 0.9431 | 1.0469 | 1.16 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | nan | 0.9443 | 1.0336 | 1.124 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1022 | 1.1162 | | crossvit_9_240 | 64 | 0.9874 | 0.8698 | 0.3378 | 0.8854 | 0.7934 | 1.0957 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | 0.9298 | 1.0004 | 1.0937 | | twins_pcpvt_base | 32 | 0.9938 | 0.9046 | 0.3492 | 0.8007 | 0.9337 | 1.0923 | | dla102 | 64 | 0.9931 | 0.9487 | 0.3592 | 0.9751 | 1.079 | 1.0867 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8794 | 0.8911 | 1.0785 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8801 | 0.8916 | 1.0772 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | 0.8587 | 1.0138 | 1.0716 | | nfnet_l0 | 64 | 0.9884 | 0.8166 | 0.2786 | 0.8207 | 1.0034 | 1.0713 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3469 | 0.8884 | 0.9382 | 1.0646 | | resnest101e | 32 | 0.9955 | 0.9721 | 0.3558 | 0.9532 | 1.0272 | 1.058 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.3371 | 0.9026 | 0.9897 | 1.0571 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 0.6593 | 0.7962 | 1.0496 | | tf_mixnet_l | 64 | 0.9903 | 0.8556 | 0.2894 | 0.8366 | 0.9291 | 1.0459 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9129 | 1.0048 | 1.021 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8772 | 0.9715 | 1.0173 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9289 | 1.014 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | nan | 0.8092 | 1.0011 | | cait_m36_384 | 2 | 0.9993 | 0.8803 | nan | 0.903 | 0.8949 | 0.9997 | | mixer_b16_224 | 64 | 0.9929 | 0.9361 | 0.3571 | 0.7726 | 0.8978 | 0.9895 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | 0.9714 | 0.9746 | 0.9846 | | res2net50_14w_8s | 2 | 0.9968 | 0.824 | 0.4257 | 0.8169 | 0.8228 | 0.9804 | | res2next50 | 2 | 0.9976 | 0.8277 | 0.4221 | 0.8198 | 0.8231 | 0.979 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | 0.79 | 0.9645 | 0.9776 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.9146 | 0.9605 | 0.9746 | | mixnet_l | 64 | 0.99 | 0.8439 | 0.2738 | 0.7742 | 0.8647 | 0.9708 | | convnext_base | 32 | 1.0034 | 0.9053 | nan | 0.7521 | 0.8848 | 0.9666 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | 0.8814 | 0.9075 | 0.9593 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | 0.8487 | 0.9112 | 0.9354 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | 0.8451 | 0.7566 | 0.9238 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | 0.8524 | 0.8964 | 0.9224 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3301 | 0.8641 | 0.8948 | 0.916 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8538 | 0.8845 | 0.8998 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7708 | 0.8503 | 0.898 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | 0.8854 | 0.8924 | 0.8971 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8158 | 0.8621 | 0.8897 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3344 | 0.8371 | 0.8602 | 0.8784 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.7559 | 0.8309 | 0.8769 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | 0.86 | 0.6708 | 0.8749 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3348 | 0.8252 | 0.8503 | 0.8698 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7515 | 0.8245 | 0.8627 | | cspdarknet53 | 64 | 0.9915 | 0.8407 | 0.3241 | 0.7908 | 0.8512 | 0.8583 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7352 | 0.8387 | 0.8542 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3406 | 0.6789 | 0.7905 | 0.8278 | | hrnet_w18 | 2 | 0.9971 | 0.8333 | 0.4258 | 0.8355 | 0.8367 | 0.6644 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/48gqIxr.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/rkQ67e9.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/KjBc7VX.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 89%, 49/55 | 98%, 42/43  | 100%, 61/61 |
|       aot_eager        | 89%, 49/55 | 98%, 42/43  | 97%, 59/61  |
|     aot_cudagraphs     | 73%, 40/55 | 49%, 21/43  | 38%, 23/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/43   | 87%, 53/61  |
|        inductor        | 85%, 47/55 | 93%, 40/43  | 97%, 59/61  |
| inductor_no_cudagraphs | 91%, 50/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.50x    |    1.31x    |    1.26x    |
| inductor_no_cudagraphs |   1.23x    |    1.21x    |    1.25x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.78    |    2.15     |    1.94     |
|       aot_eager        |    6.47    |    9.29     |    9.22     |
|     aot_cudagraphs     |    6.79    |    12.09    |    16.48    |
|      aot_nvfuser       |   20.61    |    9.84     |    51.45    |
|        inductor        |   62.14    |    53.79    |    73.53    |
| inductor_no_cudagraphs |   61.41    |    48.85    |    72.55    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.35x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.84x    |    0.79x    |    0.96x    |
| inductor_no_cudagraphs |   0.93x    |    0.96x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.002 | 0.9978 | 2.3409 | 1.4467 | 5.1622 | 1.2963 | | timm_efficientdet | 1 | 0.979 | 0.8852 | 0.0 | 0.0 | 4.3858 | 1.5765 | | functorch_dp_cifar10 | 64 | 1.0026 | 0.9744 | 1.9559 | 1.1928 | 3.7282 | 1.2617 | | timm_vision_transformer | 8 | 1.0086 | 0.9257 | 1.4644 | 1.3478 | 2.6512 | 1.4226 | | drq | 1 | 1.0126 | 0.8387 | 1.6455 | 1.0639 | 2.4863 | 1.0927 | | mobilenet_v3_large | 32 | 1.0056 | 1.107 | 1.0133 | 1.377 | 2.0995 | 1.3533 | | resnext50_32x4d | 8 | 1.0026 | 1.0844 | 1.1689 | 1.3748 | 2.0143 | 1.2126 | | BERT_pytorch | 16 | 1.0102 | 0.877 | 0.0 | 0.0 | 1.9299 | 1.8929 | | pytorch_struct | 200 | 0.9985 | 0.7879 | 0.8584 | 0.8896 | 1.836 | 1.1984 | | lennard_jones | 1000 | 0.9749 | 0.8189 | 1.075 | 1.0247 | 1.8158 | 0.9447 | | resnet18 | 16 | 1.0028 | 1.0909 | 1.2494 | 1.3811 | 1.8013 | 1.2601 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9988 | 0.9368 | 1.2228 | 1.1906 | 1.7624 | 1.3194 | | squeezenet1_1 | 32 | 0.9955 | 0.9905 | 1.0378 | 1.1552 | 1.7426 | 1.2754 | | hf_Albert | 8 | 1.0015 | 0.9973 | 0.7452 | 0.0 | 1.6602 | 1.6555 | | dcgan | 32 | 0.9837 | 1.008 | 1.2562 | 1.1377 | 1.6441 | 1.0771 | | hf_T5_large | 2 | 1.0249 | 0.8892 | 0.0 | 0.0 | 1.6282 | 1.5773 | | speech_transformer | 32 | 1.0064 | 0.8979 | 0.0 | 0.0 | 1.5722 | 1.5391 | | soft_actor_critic | 256 | 0.9761 | 0.7772 | 1.0678 | 0.9922 | 1.5653 | 0.9507 | | shufflenet_v2_x1_0 | 128 | 1.0006 | 1.0507 | 0.813 | 1.1896 | 1.5405 | 1.3842 | | timm_resnest | 32 | 0.9995 | 1.0025 | 0.8052 | 1.1829 | 1.5228 | 1.4543 | | timm_nfnet | 128 | 0.9994 | 1.0003 | 0.0 | 1.2123 | 1.4689 | 1.4212 | | mnasnet1_0 | 32 | 1.0006 | 1.088 | 0.8592 | 1.3001 | 1.4623 | 1.2718 | | hf_GPT2 | 4 | 1.0115 | 0.9802 | 0.7295 | 0.0 | 1.4388 | 1.4355 | | mobilenet_v2 | 96 | 0.9999 | 0.9955 | 0.7297 | 1.0445 | 1.4283 | 1.408 | | fastNLP_Bert | 6 | 0.999 | 0.9684 | 0.7536 | 0.0 | 1.3722 | 1.3446 | | timm_efficientnet | 32 | 0.9556 | 0.8062 | 0.6823 | 1.0603 | 1.2967 | 1.2046 | | LearningToPaint | 96 | 1.0004 | 1.0523 | 0.8579 | 1.2275 | 1.251 | 1.208 | | hf_Bart | 4 | 1.0139 | 0.9742 | 0.726 | 0.0 | 1.2383 | 1.1758 | | resnet50 | 32 | 0.9994 | 0.9942 | 0.7614 | 1.163 | 1.2065 | 1.1694 | | pytorch_unet | 1 | 0.9996 | 0.9976 | 0.8458 | 1.076 | 1.2001 | 1.1857 | | Super_SloMo | 6 | 1.0001 | 0.9973 | 0.8672 | 0.0 | 1.1792 | 1.1644 | | vgg16 | 64 | 0.9998 | 0.9989 | 0.8582 | 0.9973 | 1.1729 | 1.1659 | | alexnet | 128 | 0.9993 | 0.998 | 0.8026 | 1.0005 | 1.162 | 1.1631 | | hf_Bert | 4 | 1.0263 | 0.9917 | 0.7132 | 0.0 | 1.1598 | 1.1563 | | hf_DistilBert | 8 | 1.0008 | 0.9562 | 0.6702 | 0.0 | 1.1516 | 1.1614 | | timm_regnet | 32 | 0.9654 | 0.9637 | 0.7808 | 1.095 | 1.129 | 1.0942 | | pytorch_stargan | 16 | 0.9988 | 0.9833 | 0.8651 | 0.9889 | 1.1239 | 1.0911 | | Background_Matting | 4 | 1.0004 | 1.0228 | 0.8681 | 1.0822 | 1.1156 | 1.1072 | | hf_Reformer | 4 | 0.9966 | 0.0 | 0.9273 | 0.0 | 1.1099 | 1.1339 | | hf_BigBird | 2 | 0.9955 | 0.9411 | 0.9488 | 0.0 | 1.1021 | 1.0022 | | yolov3 | 16 | 0.9999 | 0.9947 | 0.7916 | 1.1843 | 1.079 | 1.0653 | | timm_vision_transformer_large | 8 | 1.0 | 0.993 | 0.0 | 0.9826 | 1.0544 | 1.0394 | | attention_is_all_you_need_pytorch | 256 | 1.0001 | 0.9736 | 0.0 | 0.0 | 1.0438 | 1.0293 | | timm_vovnet | 32 | 0.9113 | 0.9041 | 0.715 | 0.9779 | 1.0066 | 1.0299 | | demucs | 4 | 1.0003 | 1.0 | 1.0001 | 0.9995 | 1.0004 | 1.0006 | | tts_angular | 64 | 0.9795 | 0.9612 | 0.9842 | 0.993 | 0.9987 | 1.009 | | nvidia_deeprecommender | 256 | 0.9997 | 0.9632 | 0.5851 | 0.9435 | 0.904 | 0.964 | | dlrm | 2048 | 0.0 | 1.0787 | 0.0 | 0.0 | 0.0 | 1.2115 | | hf_GPT2_large | 4 | 1.0005 | 0.9803 | 0.0 | 0.0 | 0.0 | 1.3857 | | hf_T5 | 8 | 1.0017 | 0.9911 | 0.0 | 0.0 | 0.0 | 1.5038 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenet_v2_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | resnet50_quantized_qat | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v2_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5002 | 38.2722 | nan | nan | 459.5236 | 476.2338 | | yolov3 | 16 | 2.8733 | 8.9076 | 12.2564 | 44.9771 | 413.9688 | 410.5355 | | hf_T5_large | 2 | 12.6314 | 41.5885 | nan | nan | 205.4873 | 201.7729 | | timm_vision_transformer | 8 | 0.7856 | 4.5736 | 5.9599 | 9.5398 | 153.4416 | 160.3076 | | speech_transformer | 32 | 1.6279 | 8.4704 | nan | nan | 152.9591 | 148.9265 | | timm_resnest | 32 | 0.5492 | 2.6896 | 3.9212 | 35.7153 | 147.2225 | 142.5265 | | attention_is_all_you_need_pytorch | 256 | 1.1225 | 7.5134 | nan | nan | 136.3311 | 137.5095 | | timm_vision_transformer_large | 8 | 2.2943 | 14.4897 | nan | 25.9253 | 117.3628 | 123.3241 | | pytorch_stargan | 16 | 0.3849 | 2.4361 | 3.2232 | 4.0227 | 92.3776 | 102.2742 | | pytorch_struct | 200 | 0.2404 | 0.8247 | 1.3498 | 4.1599 | 92.3395 | 73.867 | | BERT_pytorch | 16 | 1.4589 | 7.7754 | nan | nan | 92.0246 | 93.1061 | | fastNLP_Bert | 6 | 1.4803 | 7.0115 | 10.6166 | nan | 66.1196 | 64.1 | | hf_GPT2 | 4 | 1.2849 | 6.4734 | 9.581 | nan | 63.4356 | 62.7736 | | hf_Bart | 4 | 1.4354 | 8.3888 | 12.4238 | nan | 50.8848 | 49.4684 | | mobilenet_v3_large | 32 | 0.8558 | 5.0971 | 7.0187 | 53.4283 | 46.4169 | 46.0411 | | densenet121 | 4 | 2.0854 | 13.8001 | 21.0823 | 89.7884 | 45.582 | 43.524 | | hf_Albert | 8 | 1.0142 | 5.9834 | 8.9882 | nan | 43.2885 | 40.8113 | | hf_BigBird | 2 | 7.3116 | 13.8529 | 30.4412 | nan | 41.7364 | 26.9154 | | hf_Bert | 4 | 1.3986 | 6.4943 | 9.2672 | nan | 40.2879 | 39.072 | | hf_Reformer | 4 | 2.4077 | nan | 9.3638 | nan | 36.1775 | 34.6912 | | timm_regnet | 32 | 2.2098 | 8.7135 | 21.5731 | 48.0162 | 35.5577 | 32.6271 | | timm_efficientnet | 32 | 1.7224 | 6.9002 | 16.2455 | 53.1702 | 35.5396 | 33.1986 | | timm_nfnet | 128 | 1.9232 | 7.9641 | nan | 30.4729 | 31.4204 | 29.4342 | | hf_DistilBert | 8 | 0.4805 | 3.1646 | 5.9828 | nan | 30.6968 | 30.7352 | | resnet50 | 32 | 0.8401 | 5.1077 | 7.1067 | 32.6952 | 30.61 | 28.9547 | | timm_vovnet | 32 | 1.4643 | 4.7686 | 10.593 | 23.8195 | 30.2593 | 27.0792 | | resnext50_32x4d | 8 | 0.8621 | 5.0762 | 7.0093 | 28.8407 | 29.692 | 29.1772 | | mnasnet1_0 | 32 | 0.7748 | 4.7415 | 6.5712 | 31.1452 | 28.9074 | 28.2481 | | functorch_dp_cifar10 | 64 | 0.3534 | 2.0922 | 2.9489 | 5.6444 | 26.0975 | 25.5771 | | resnet18 | 16 | 0.399 | 1.9641 | 2.8601 | 17.4993 | 22.8969 | 21.9235 | | shufflenet_v2_x1_0 | 128 | 0.9024 | 5.6571 | 8.0459 | 27.0932 | 18.3878 | 17.7721 | | Super_SloMo | 6 | 1.013 | 5.334 | 7.0228 | nan | 17.2736 | 16.7825 | | Background_Matting | 4 | 0.7505 | 4.7428 | 6.8589 | 30.3629 | 17.0207 | 15.968 | | mobilenet_v2 | 96 | 0.7851 | 4.8185 | 6.9951 | 37.068 | 16.9112 | 16.395 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.379 | 2.3162 | 3.0731 | 3.9214 | 8.3507 | 8.0143 | | pytorch_unet | 1 | 0.4315 | 2.2389 | 3.0989 | 19.953 | 8.3006 | 7.9089 | | LearningToPaint | 96 | 0.4277 | 2.0612 | 2.9069 | 24.0299 | 7.2519 | 6.9993 | | squeezenet1_1 | 32 | 0.2226 | 1.006 | 1.4333 | 4.577 | 4.0949 | 3.7854 | | nvidia_deeprecommender | 256 | 0.1907 | 0.4349 | 0.6856 | 2.4495 | 4.0058 | 3.71 | | drq | 1 | 0.1368 | 0.4539 | 0.7737 | 3.5104 | 3.8337 | 3.2633 | | vgg16 | 64 | 0.1795 | 0.6625 | 1.0525 | 2.493 | 3.5342 | 3.3473 | | soft_actor_critic | 256 | 0.1968 | 0.3518 | 0.5498 | 1.5108 | 3.4266 | 2.7146 | | alexnet | 128 | 0.1445 | 0.4185 | 0.6973 | 2.3862 | 2.9512 | 2.6905 | | dcgan | 32 | 0.1677 | 0.4586 | 0.6723 | 3.767 | 2.6567 | 2.4781 | | lennard_jones | 1000 | 0.1388 | 0.292 | 0.4421 | 1.0721 | 1.9645 | 1.7635 | | tts_angular | 64 | 0.2095 | 0.2684 | 0.4085 | 0.9997 | 1.8827 | 1.6681 | | demucs | 4 | 0.3029 | 0.2973 | 0.3102 | 0.3061 | 0.2077 | 0.2076 | | hf_GPT2_large | 4 | 4.9466 | 19.8968 | nan | nan | nan | 142.6081 | | hf_T5 | 8 | 2.055 | 9.4609 | nan | nan | nan | 44.7663 | | dlrm | 2048 | nan | 0.8311 | nan | nan | nan | 2.9874 | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2636 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9525 | 0.363 | nan | 1.1857 | 1.1913 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.887 | 0.353 | nan | 0.9505 | 1.0819 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.8024 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.8959 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9149 | 0.3919 | 0.9141 | 0.8862 | 0.9646 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | 0.8681 | 0.8829 | 0.8964 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2822 | nan | 0.8804 | 1.1942 | | hf_T5_large | 2 | 0.922 | 0.8722 | nan | nan | 0.8737 | 0.922 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3437 | 0.8551 | 0.857 | 0.9307 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | hf_Bert | 4 | 0.9683 | 0.8952 | 0.3395 | nan | 0.8564 | 0.9017 | | hf_Bart | 4 | 0.9618 | 0.879 | 0.3245 | nan | 0.8531 | 1.0964 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | nan | 0.8343 | 1.0755 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8203 | 0.8303 | 0.8352 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8286 | 0.9823 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4303 | nan | 0.8205 | 1.0404 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.816 | 0.9432 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.2988 | nan | 0.7841 | 0.8605 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.716 | 0.3383 | 0.6272 | 0.7133 | 0.7462 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3307 | 0.8104 | 0.712 | 0.7779 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6971 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9445 | nan | nan | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8768 | nan | nan | nan | 1.0941 | | dlrm | 2048 | nan | 0.7305 | nan | nan | nan | 0.7306 | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | mobilenet_v2_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | resnet50_quantized_qat | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0322 | 0.9286 | 0.0 | 0.0 | 3.2857 | 1.4837 | | MT5ForConditionalGeneration | 8 | 1.0241 | 0.911 | 0.0 | 0.0 | 2.6289 | 1.9706 | | MobileBertForMaskedLM | 32 | 1.0238 | 0.8967 | 0.0 | 0.0 | 2.5667 | 1.567 | | CamemBert | 1 | 1.0522 | 0.9506 | 1.3222 | 0.0 | 2.3526 | 1.5538 | | GoogleFnet | 1 | 0.9976 | 0.8053 | 0.9757 | 1.1197 | 2.1236 | 1.1053 | | DistillGPT2 | 1 | 1.0358 | 0.9382 | 1.0302 | 0.0 | 2.0234 | 1.8859 | | GPT2ForSequenceClassification | 4 | 1.0005 | 0.971 | 0.0 | 0.0 | 1.6716 | 1.6623 | | M2M100ForConditionalGeneration | 8 | 1.1038 | 1.0651 | 0.9719 | 0.0 | 1.5317 | 1.3122 | | T5ForConditionalGeneration | 4 | 1.0005 | 0.9631 | 0.0 | 0.0 | 1.4354 | 1.427 | | MobileBertForQuestionAnswering | 64 | 1.0235 | 0.8892 | 0.0 | 0.0 | 1.4293 | 1.2814 | | T5Small | 1 | 1.0256 | 0.9437 | 0.0 | 0.0 | 1.4051 | 1.1741 | | ElectraForCausalLM | 32 | 1.0009 | 0.9321 | 0.0 | 0.0 | 1.3666 | 1.4021 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.9859 | 0.0 | 0.0 | 1.3611 | 1.3419 | | PLBartForConditionalGeneration | 16 | 1.016 | 0.8877 | 0.7834 | 0.0 | 1.311 | 1.2039 | | AlbertForQuestionAnswering | 4 | 1.0003 | 1.002 | 0.0 | 0.0 | 1.2669 | 1.2601 | | AlbertForMaskedLM | 4 | 1.0002 | 0.9999 | 0.0 | 0.0 | 1.2626 | 1.256 | | XGLMForCausalLM | 8 | 1.0119 | 0.9414 | 0.0 | 0.0 | 1.2519 | 1.1754 | | LayoutLMForSequenceClassification | 16 | 1.0002 | 0.9897 | 0.7379 | 0.0 | 1.2491 | 1.2388 | | OPTForCausalLM | 32 | 1.0028 | 0.918 | 0.6969 | 0.0 | 1.1786 | 1.2 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9642 | 0.0 | 0.0 | 1.1709 | 1.1757 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9862 | 0.7137 | 0.0 | 1.1482 | 1.1336 | | RobertaForCausalLM | 64 | 1.0004 | 0.9547 | 0.7336 | 0.0 | 1.1231 | 1.1303 | | MegatronBertForQuestionAnswering | 16 | 1.0402 | 0.9223 | 0.7571 | 0.0 | 1.1148 | 1.0728 | | Speech2Text2ForCausalLM | 128 | 0.9986 | 0.9285 | 0.6608 | 0.0 | 1.1048 | 1.144 | | BartForConditionalGeneration | 2 | 1.0005 | 0.9872 | 0.0 | 0.0 | 1.1031 | 1.0956 | | BartForCausalLM | 4 | 1.0008 | 0.9685 | 0.7402 | 0.0 | 1.1 | 1.1091 | | MBartForConditionalGeneration | 16 | 1.0401 | 0.9843 | 0.7543 | 0.0 | 1.098 | 1.0901 | | BigBird | 1 | 0.9925 | 0.9408 | 1.0057 | 0.0 | 1.0969 | 1.0001 | | MegatronBertForCausalLM | 16 | 1.0338 | 0.9888 | 0.7351 | 0.0 | 1.0925 | 1.079 | | DebertaForMaskedLM | 4 | 0.932 | 0.8128 | 0.7366 | 0.0 | 1.0855 | 1.0664 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.9869 | 0.0 | 0.0 | 1.0847 | 1.0719 | | BertForQuestionAnswering | 128 | 1.0001 | 0.9941 | 0.0 | 0.0 | 1.0845 | 1.0723 | | PegasusForConditionalGeneration | 16 | 1.0116 | 0.9852 | 0.7616 | 0.0 | 1.0833 | 1.0789 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9394 | 0.0 | 0.0 | 1.0681 | 1.0753 | | DebertaForQuestionAnswering | 8 | 0.9974 | 0.9935 | 0.6831 | 0.0 | 1.0623 | 1.2018 | | DistilBertForMaskedLM | 64 | 1.0006 | 0.9517 | 0.6893 | 0.0 | 1.0416 | 1.0592 | | BertForMaskedLM | 64 | 1.0002 | 0.962 | 0.7175 | 0.0 | 1.0372 | 1.0413 | | PLBartForCausalLM | 32 | 1.0048 | 0.9226 | 0.6882 | 0.0 | 1.0224 | 1.0484 | | BlenderbotSmallForCausalLM | 64 | 1.0016 | 0.9115 | 0.653 | 0.0 | 1.0059 | 1.0432 | | TrOCRForCausalLM | 32 | 1.0012 | 0.958 | 0.0 | 0.0 | 1.0014 | 1.0132 | | MBartForCausalLM | 32 | 1.0008 | 0.958 | 0.7186 | 0.0 | 0.9988 | 1.0092 | | PegasusForCausalLM | 32 | 0.9997 | 0.9548 | 0.732 | 0.0 | 0.9911 | 1.0041 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2985 | 12.6763 | nan | nan | 203.6891 | 200.2767 | | DebertaForQuestionAnswering | 8 | 4.5403 | 11.4247 | 46.4595 | nan | 171.5092 | 102.4203 | | DebertaForMaskedLM | 4 | 4.6123 | 11.5887 | 46.0078 | nan | 171.2419 | 104.2544 | | M2M100ForConditionalGeneration | 8 | 2.6023 | 12.9551 | 22.1161 | nan | 126.5533 | 131.4465 | | YituTechConvBert | 1 | 2.1207 | 10.3087 | nan | nan | 119.0593 | 116.0233 | | MobileBertForMaskedLM | 32 | 7.8978 | 28.8989 | nan | nan | 90.7721 | 87.9723 | | MT5ForConditionalGeneration | 8 | 3.2861 | 14.5828 | nan | nan | 89.6576 | 90.9357 | | MobileBertForQuestionAnswering | 64 | 7.9453 | 29.0084 | nan | nan | 75.2123 | 72.6774 | | MegatronBertForCausalLM | 16 | 3.1817 | 13.6768 | 20.1344 | nan | 62.2739 | 60.2614 | | MegatronBertForQuestionAnswering | 16 | 3.1327 | 13.4135 | 19.7731 | nan | 60.6824 | 58.3612 | | LayoutLMForSequenceClassification | 16 | 1.5798 | 6.8793 | 10.3152 | nan | 60.5796 | 57.4539 | | T5ForConditionalGeneration | 4 | 2.0135 | 9.3825 | nan | nan | 59.3736 | 58.6913 | | PegasusForConditionalGeneration | 16 | 2.6479 | 15.5826 | 25.1173 | nan | 58.188 | 54.1699 | | BartForConditionalGeneration | 2 | 2.8818 | 15.9096 | nan | nan | 57.0617 | 55.1758 | | MBartForConditionalGeneration | 16 | 2.8784 | 15.9804 | 25.3712 | nan | 54.1345 | 52.6786 | | T5Small | 1 | 2.015 | 9.3876 | nan | nan | 54.0177 | 54.2117 | | PLBartForConditionalGeneration | 16 | 1.3858 | 8.4331 | 11.9259 | nan | 48.2602 | 46.3276 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7395 | 10.4133 | nan | nan | 43.3567 | 42.0299 | | BigBird | 1 | 7.2389 | 13.8123 | 30.1154 | nan | 41.5957 | 26.8249 | | ElectraForCausalLM | 32 | 1.3496 | 6.5873 | nan | nan | 40.8107 | 39.8316 | | DistillGPT2 | 1 | 0.6566 | 3.2629 | 4.4575 | nan | 34.8336 | 33.828 | | LayoutLMForMaskedLM | 16 | 1.4746 | 6.8259 | nan | nan | 32.6309 | 31.9015 | | BertForMaskedLM | 64 | 1.3679 | 6.9199 | 9.7535 | nan | 32.5267 | 32.2817 | | ElectraForQuestionAnswering | 64 | 1.4289 | 6.5969 | nan | nan | 32.2964 | 31.472 | | GPT2ForSequenceClassification | 4 | 1.3167 | 6.4634 | nan | nan | 31.0139 | 30.7384 | | RobertaForCausalLM | 64 | 1.367 | 6.6562 | 9.9558 | nan | 28.5936 | 27.7608 | | BertForQuestionAnswering | 128 | 1.3766 | 6.5759 | nan | nan | 27.7677 | 27.211 | | PegasusForCausalLM | 32 | 1.035 | 6.0746 | 9.4917 | nan | 27.1192 | 25.3485 | | MBartForCausalLM | 32 | 1.0118 | 5.9265 | 8.9121 | nan | 25.3271 | 24.2305 | | TrOCRForCausalLM | 32 | 0.9907 | 5.9939 | nan | nan | 24.5949 | 23.713 | | BartForCausalLM | 4 | 1.0684 | 5.9475 | 9.1817 | nan | 24.4743 | 22.8046 | | RobertaForQuestionAnswering | 128 | 1.4283 | 6.7034 | nan | nan | 24.4342 | 23.8045 | | AlbertForMaskedLM | 4 | 1.1046 | 6.1904 | nan | nan | 23.6455 | 22.665 | | GoogleFnet | 1 | 0.802 | 3.5063 | 10.9843 | 9.8353 | 23.5824 | 16.0928 | | BlenderbotSmallForCausalLM | 64 | 0.6434 | 4.0765 | 6.1732 | nan | 23.252 | 22.4689 | | DistilBertForMaskedLM | 64 | 0.5302 | 3.1918 | 6.3091 | nan | 23.0614 | 22.7622 | | AlbertForQuestionAnswering | 4 | 1.1145 | 6.1583 | nan | nan | 22.4725 | 21.4997 | | DistilBertForQuestionAnswering | 64 | 0.5133 | 3.1543 | 6.4246 | nan | 22.1165 | 21.87 | | OPTForCausalLM | 32 | 1.0542 | 6.3196 | 14.0811 | nan | 21.8721 | 21.0082 | | CamemBert | 1 | 1.4438 | 6.5437 | 9.3165 | nan | 21.6547 | 21.1919 | | Speech2Text2ForCausalLM | 128 | 0.5788 | 3.1117 | 4.8042 | nan | 19.7945 | 17.9931 | | PLBartForCausalLM | 32 | 0.4783 | 3.1633 | 4.5835 | nan | 18.8242 | 18.1771 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0318 | 1.0911 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.9489 | 1.0035 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.9489 | 1.0035 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.9361 | 1.0025 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.9339 | 0.9827 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.8896 | 0.9987 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.8698 | 0.9409 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | nan | 0.8602 | 0.8971 | | T5Small | 1 | 1.0 | 0.9325 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9643 | 0.3704 | nan | 0.8446 | 0.9753 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3405 | nan | 0.8438 | 0.9191 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.8428 | 0.9785 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8734 | 0.3426 | nan | 0.8412 | 0.9604 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.841 | 1.3637 | | BartForConditionalGeneration | 2 | 1.0 | 0.9038 | nan | nan | 0.84 | 0.9815 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3433 | nan | 0.8321 | 0.922 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3429 | nan | 0.8309 | 0.9212 | | BigBird | 1 | 0.999 | 0.9542 | 0.4213 | nan | 0.822 | 1.0115 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.8201 | 1.3395 | | DistillGPT2 | 1 | 0.9984 | 0.7704 | 0.3572 | nan | 0.8184 | 0.93 | | CamemBert | 1 | 0.998 | 0.7977 | 0.3507 | nan | 0.8088 | 0.8656 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3428 | nan | 0.8083 | 0.8986 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.8079 | 0.8984 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | nan | nan | 0.8058 | 0.9504 | | YituTechConvBert | 1 | 0.9858 | 0.7923 | nan | nan | 0.8025 | 0.8667 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8721 | 0.3374 | nan | 0.798 | 0.9514 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | 0.3276 | nan | 0.7952 | 0.9067 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8964 | 0.3314 | nan | 0.7861 | 0.9514 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.7793 | 0.8833 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3557 | nan | 0.7739 | 0.8854 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.842 | 0.3524 | nan | 0.7727 | 0.8857 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3394 | nan | 0.7724 | 0.8899 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7687 | 0.9366 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | M2M100ForConditionalGeneration | 8 | 1.0004 | 0.9685 | 0.4048 | nan | 0.755 | 0.9848 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.7528 | 0.9074 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3578 | nan | 0.7277 | 0.8452 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3554 | nan | 0.4265 | 1.0346 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9952 | 0.8323 | 1.2494 | 1.8151 | 1.7763 | | lcnet_050 | 128 | 0.9559 | 0.9497 | 0.7507 | 1.5008 | 1.6591 | 1.6342 | | coat_lite_mini | 128 | 0.9999 | 0.998 | 0.8425 | 1.0567 | 1.6343 | 1.607 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9991 | 0.0 | 1.6291 | 1.5537 | 1.5498 | | regnety_002 | 128 | 0.976 | 0.9853 | 0.8505 | 1.3458 | 1.5378 | 1.3541 | | dm_nfnet_f0 | 128 | 0.9993 | 1.0002 | 0.0 | 1.2113 | 1.4732 | 1.4221 | | xcit_large_24_p8_224 | 5 | 1.0027 | 0.9974 | 0.0 | 0.0 | 1.4616 | 1.4159 | | hrnet_w18 | 128 | 0.9999 | 0.9986 | 0.0 | 1.32 | 1.4201 | 1.3797 | | volo_d1_224 | 64 | 0.9999 | 0.9961 | 0.0 | 1.1322 | 1.3882 | 1.3669 | | dla102 | 128 | 0.9997 | 1.0009 | 0.0 | 1.2858 | 1.3843 | 1.3695 | | nfnet_l0 | 128 | 0.9997 | 0.7892 | 0.0 | 1.0551 | 1.3763 | 1.3278 | | res2net50_14w_8s | 128 | 0.9998 | 0.9998 | 0.0 | 1.2272 | 1.3584 | 1.3263 | | crossvit_9_240 | 128 | 0.9996 | 0.9994 | 0.0 | 1.0253 | 1.3373 | 1.3113 | | mobilenetv3_large_100 | 128 | 0.9655 | 0.9602 | 0.7644 | 1.169 | 1.3354 | 1.3423 | | mobilenetv2_100 | 128 | 0.9652 | 0.964 | 0.7064 | 1.0159 | 1.3344 | 1.3503 | | gluon_inception_v3 | 128 | 1.0 | 0.9987 | 0.0 | 1.1259 | 1.3282 | 1.3071 | | adv_inception_v3 | 128 | 0.9999 | 0.9993 | 0.0 | 1.1264 | 1.3276 | 1.31 | | inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1258 | 1.3259 | 1.3087 | | res2next50 | 128 | 1.0 | 1.001 | 0.0 | 1.1668 | 1.3135 | 1.2754 | | resnest101e | 64 | 0.9998 | 1.0032 | 0.0 | 1.1972 | 1.3124 | 1.2723 | | fbnetv3_b | 128 | 0.9649 | 0.959 | 0.7493 | 1.135 | 1.2824 | 1.2923 | | gmixer_24_224 | 128 | 0.9998 | 0.8347 | 0.0 | 0.98 | 1.2817 | 1.2725 | | jx_nest_base | 32 | 0.9999 | 0.9953 | 0.0 | 1.2171 | 1.2769 | 1.2512 | | botnet26t_256 | 128 | 0.9857 | 0.9855 | 0.7904 | 1.2275 | 1.2686 | 1.2808 | | mnasnet_100 | 128 | 0.9662 | 0.9638 | 0.7858 | 1.1594 | 1.268 | 1.2825 | | selecsls42b | 128 | 1.0 | 0.9993 | 0.8142 | 1.2101 | 1.2652 | 1.2528 | | sebotnet33ts_256 | 64 | 0.9755 | 0.8077 | 0.0 | 1.0536 | 1.2648 | 1.2709 | | eca_botnext26ts_256 | 128 | 0.9868 | 0.7726 | 0.0 | 1.0303 | 1.2634 | 1.2491 | | eca_halonext26ts | 128 | 0.9872 | 0.7789 | 0.0 | 1.03 | 1.2616 | 1.241 | | tf_efficientnet_b0 | 128 | 0.9768 | 0.784 | 0.0 | 0.9856 | 1.258 | 1.2647 | | convit_base | 64 | 0.9997 | 0.999 | 0.0 | 1.1951 | 1.2566 | 1.2326 | | fbnetc_100 | 128 | 0.9665 | 0.9633 | 0.7791 | 1.1894 | 1.2472 | 1.2663 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.978 | 0.7438 | 1.1463 | 1.2445 | 1.2496 | | spnasnet_100 | 128 | 0.9616 | 0.9576 | 0.7728 | 1.1379 | 1.2351 | 1.2553 | | cspdarknet53 | 64 | 0.9578 | 0.9533 | 0.7372 | 1.1844 | 1.229 | 1.2348 | | pit_b_224 | 64 | 1.0 | 0.9997 | 0.0 | 1.0554 | 1.2276 | 1.2162 | | res2net101_26w_4s | 64 | 1.0003 | 0.998 | 0.7702 | 1.1754 | 1.2263 | 1.1907 | | gmlp_s16_224 | 128 | 0.9999 | 0.999 | 0.0 | 0.9988 | 1.2237 | 1.213 | | rexnet_100 | 128 | 0.9726 | 0.8172 | 0.0 | 0.9838 | 1.2143 | 1.2197 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.084 | 1.2098 | 1.1938 | | tinynet_a | 128 | 0.9665 | 0.7757 | 0.6203 | 0.9715 | 1.1896 | 1.2007 | | dpn107 | 32 | 0.9582 | 0.9509 | 0.7794 | 1.029 | 1.1894 | 1.203 | | cait_m36_384 | 4 | 0.9998 | 1.026 | 0.0 | 1.01 | 1.1867 | 1.1621 | | mobilevit_s | 64 | 0.9796 | 0.7621 | 0.0 | 0.9503 | 1.173 | 1.17 | | tf_mixnet_l | 128 | 0.9857 | 0.8902 | 0.0 | 1.0181 | 1.1711 | 1.1706 | | repvgg_a2 | 128 | 0.9636 | 0.9628 | 0.8262 | 1.1216 | 1.1698 | 1.1673 | | poolformer_m36 | 64 | 0.9998 | 0.9997 | 0.0 | 0.0 | 1.1664 | 1.1478 | | mixnet_l | 128 | 0.9849 | 0.886 | 0.0 | 1.0186 | 1.1532 | 1.1532 | | twins_pcpvt_base | 64 | 0.9998 | 0.9995 | 0.7488 | 1.0638 | 1.1525 | 1.1237 | | convnext_base | 64 | 1.0 | 0.9987 | 0.0 | 1.0437 | 1.1466 | 1.1195 | | swin_base_patch4_window7_224 | 64 | 1.0 | 0.9791 | 0.0 | 0.9888 | 1.1417 | 1.1351 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9813 | 0.0 | 0.9496 | 1.1189 | 1.1087 | | swsl_resnext101_32x16d | 32 | 1.0 | 0.9995 | 0.0 | 1.1083 | 1.1092 | 1.0714 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9992 | 0.7653 | 1.0117 | 1.1004 | 1.0909 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9974 | 0.7675 | 0.9727 | 1.0936 | 1.0813 | | gluon_xception65 | 32 | 0.9997 | 0.9971 | 0.0 | 1.0382 | 1.0873 | 1.0759 | | mixer_b16_224 | 128 | 1.0002 | 1.0003 | 0.0 | 0.9764 | 1.0833 | 1.0742 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 1.0614 | 1.0775 | 1.0747 | | gernet_l | 128 | 0.9742 | 0.9725 | 0.8233 | 1.098 | 1.0765 | 1.0717 | | visformer_small | 128 | 0.9999 | 1.0029 | 0.7984 | 1.0208 | 1.0507 | 1.0175 | | resmlp_12_224 | 128 | 1.0 | 1.0007 | 0.6957 | 0.0 | 0.958 | 0.9617 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.1064 | 13.6124 | 22.5332 | 44.5522 | 407.4272 | 421.8365 | | coat_lite_mini | 128 | 1.0545 | 5.4638 | 8.5507 | 14.592 | 363.5371 | 364.8512 | | mobilevit_s | 64 | 1.6462 | 7.5357 | nan | 42.5465 | 235.2127 | 233.1142 | | eca_halonext26ts | 128 | 1.5109 | 5.8122 | nan | 55.3877 | 209.9798 | 206.4748 | | sebotnet33ts_256 | 64 | 1.6237 | 6.6615 | nan | 51.2499 | 190.7575 | 190.0027 | | swin_base_patch4_window7_224 | 64 | 2.5902 | 13.0683 | nan | 59.6144 | 176.5977 | 174.5962 | | eca_botnext26ts_256 | 128 | 1.3981 | 5.4191 | nan | 53.0178 | 176.4393 | 178.3558 | | xcit_large_24_p8_224 | 5 | 2.7334 | 18.1337 | nan | nan | 171.4747 | 166.5968 | | jx_nest_base | 32 | 1.6866 | 10.041 | nan | 60.8033 | 154.3111 | 154.2772 | | cait_m36_384 | 4 | 2.7866 | 19.1316 | nan | 45.3486 | 133.9738 | 126.9168 | | convnext_base | 64 | 1.2021 | 6.4828 | nan | 21.9482 | 131.8442 | 130.3441 | | botnet26t_256 | 128 | 1.4442 | 5.0458 | 10.5014 | 40.8301 | 107.3642 | 105.6294 | | hrnet_w18 | 128 | 5.9732 | 33.7365 | nan | 258.2899 | 107.2875 | 100.8767 | | crossvit_9_240 | 128 | 1.3991 | 8.4784 | nan | 27.2552 | 97.4491 | 95.8135 | | resnest101e | 64 | 3.1166 | 17.9608 | nan | 79.3142 | 91.5494 | 88.0441 | | pnasnet5large | 16 | 4.3749 | 24.205 | nan | 126.2209 | 87.8382 | 85.2509 | | volo_d1_224 | 64 | 1.2721 | 8.0443 | nan | 27.6318 | 84.7613 | 84.3336 | | gmlp_s16_224 | 128 | 0.9982 | 6.7098 | nan | 13.7666 | 72.2002 | 69.6827 | | visformer_small | 128 | 0.9284 | 4.6798 | 6.5872 | 25.4051 | 71.6266 | 70.2003 | | pit_b_224 | 64 | 0.9623 | 5.1661 | nan | 12.8012 | 67.0609 | 65.441 | | res2net101_26w_4s | 64 | 2.9596 | 18.2967 | 30.5674 | 82.3674 | 56.5342 | 53.1778 | | tnt_s_patch16_224 | 128 | 1.6619 | 11.0684 | nan | 23.6515 | 53.1301 | 49.8372 | | gmixer_24_224 | 128 | 1.0668 | 7.7269 | nan | 16.8853 | 52.4953 | 50.7642 | | res2net50_14w_8s | 128 | 2.6497 | 16.4188 | nan | 102.5509 | 51.7915 | 49.2823 | | convit_base | 64 | 1.0183 | 6.2959 | nan | 18.2815 | 51.3509 | 49.6135 | | gluon_xception65 | 32 | 1.8166 | 11.8212 | nan | 43.0492 | 49.0287 | 45.3635 | | poolformer_m36 | 64 | 1.9625 | 9.9587 | nan | nan | 47.2636 | 45.2596 | | swsl_resnext101_32x16d | 32 | 1.656 | 10.4396 | nan | 40.025 | 43.1311 | 38.2389 | | resmlp_12_224 | 128 | 0.6509 | 2.9707 | 5.6933 | nan | 42.5741 | 41.8756 | | dpn107 | 32 | 3.9328 | 16.425 | 48.0118 | 76.8514 | 41.0088 | 38.065 | | mixer_b16_224 | 128 | 0.7583 | 3.4742 | nan | 11.027 | 37.1717 | 35.7533 | | fbnetv3_b | 128 | 3.1021 | 11.6415 | 33.1049 | 76.603 | 36.7662 | 34.5131 | | convmixer_768_32 | 32 | 1.2336 | 6.8275 | nan | 14.0789 | 36.4106 | 32.8851 | | vit_base_patch16_224 | 64 | 0.8981 | 4.6713 | 6.7835 | 9.5716 | 35.9023 | 35.086 | | deit_base_distilled_patch16_224 | 64 | 0.9625 | 4.5007 | 7.5873 | 11.1046 | 35.7683 | 35.0427 | | gluon_inception_v3 | 128 | 1.5559 | 9.4799 | nan | 67.7436 | 35.151 | 32.7369 | | tf_mixnet_l | 128 | 5.6826 | 13.4479 | nan | 69.3986 | 34.9013 | 31.9321 | | inception_v3 | 128 | 1.5586 | 9.388 | nan | 67.7377 | 34.8894 | 32.4783 | | adv_inception_v3 | 128 | 1.5685 | 9.4095 | nan | 67.7037 | 34.327 | 32.4009 | | mixnet_l | 128 | 5.3416 | 13.2173 | nan | 69.2362 | 34.0143 | 31.1931 | | ghostnet_100 | 128 | 2.7574 | 10.2258 | 15.3268 | 59.4558 | 32.8531 | 30.6823 | | dla102 | 128 | 1.7449 | 10.6583 | nan | 64.0186 | 32.7614 | 30.6239 | | beit_base_patch16_224 | 64 | 1.1397 | 5.7288 | nan | 14.5224 | 32.2892 | 30.7223 | | dm_nfnet_f0 | 128 | 2.0187 | 8.0742 | nan | 30.4353 | 31.5084 | 29.8829 | | res2next50 | 128 | 1.6204 | 9.4079 | nan | 67.6723 | 29.7379 | 28.0747 | | rexnet_100 | 128 | 1.8332 | 7.8668 | nan | 103.179 | 27.1038 | 25.8224 | | tinynet_a | 128 | 2.0351 | 8.4806 | 21.0774 | 61.8917 | 26.207 | 24.6746 | | tf_efficientnet_b0 | 128 | 1.723 | 7.1564 | nan | 61.3195 | 23.8449 | 22.4291 | | cspdarknet53 | 64 | 2.2287 | 7.9743 | 21.2872 | 49.5633 | 23.7336 | 22.5082 | | nfnet_l0 | 128 | 1.7113 | 7.931 | nan | 27.7671 | 23.085 | 21.8651 | | fbnetc_100 | 128 | 1.9814 | 7.1516 | 19.703 | 45.4371 | 22.6077 | 20.9507 | | spnasnet_100 | 128 | 1.9581 | 7.0371 | 18.6708 | 43.6267 | 22.0811 | 21.3115 | | mobilenetv3_large_100 | 128 | 1.5117 | 5.8597 | 13.8582 | 64.5508 | 20.2785 | 19.5219 | | mobilenetv2_100 | 128 | 1.5571 | 5.6349 | 14.2346 | 37.8495 | 18.9511 | 18.3495 | | regnety_002 | 128 | 1.5431 | 6.1735 | 14.19 | 46.9476 | 18.872 | 17.277 | | gernet_l | 128 | 1.9269 | 6.7529 | 17.0736 | 36.3963 | 18.5788 | 17.4833 | | mnasnet_100 | 128 | 1.569 | 5.69 | 14.5924 | 37.8397 | 18.5275 | 17.4759 | | repvgg_a2 | 128 | 1.9561 | 6.4737 | 16.3986 | 44.6516 | 18.288 | 17.3587 | | selecsls42b | 128 | 0.8176 | 4.1823 | 6.2503 | 39.0148 | 16.5258 | 15.4033 | | lcnet_050 | 128 | 0.9903 | 3.751 | 7.8725 | 31.4621 | 13.3796 | 12.394 | | ese_vovnet19b_dw | 128 | 0.9825 | 3.3443 | 7.1691 | 31.1424 | 13.0004 | 12.3977 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.4283 | 1.494 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2907 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1489 | 1.1957 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 1.0957 | 1.2656 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | 0.9562 | 1.0885 | 1.1416 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.9519 | 0.9937 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9385 | 0.944 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | 0.928 | 0.9992 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 0.9272 | 0.982 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.915 | 0.9873 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9124 | 0.9172 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.912 | 1.0039 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9095 | 0.9831 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | 0.8403 | 0.896 | 0.9842 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8932 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.8902 | 0.9143 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.8585 | 0.9871 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.8575 | 0.9714 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.8274 | 0.9755 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.7954 | 0.9838 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.792 | 0.9866 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/o0f2Sc6.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/mU42hGW.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/QFVkuRt.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 94%, 50/53 | 98%, 41/42  | 100%, 61/61 |
|       aot_eager        | 94%, 50/53 | 98%, 41/42  | 95%, 58/61  |
|     aot_cudagraphs     | 74%, 39/53 | 60%, 25/42  | 79%, 48/61  |
|      aot_nvfuser       | 60%, 32/53 |  0%, 0/42   | 80%, 49/61  |
|        inductor        | 85%, 45/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 87%, 46/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.21x    |    1.05x    |    1.00x    |
|      aot_nvfuser       |   1.17x    |    0.0x     |    1.19x    |
|        inductor        |   1.84x    |    1.76x    |    1.41x    |
| inductor_no_cudagraphs |   1.38x    |    1.54x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.95    |    2.59     |    2.20     |
|       aot_eager        |    8.21    |    13.14    |    11.44    |
|     aot_cudagraphs     |    8.47    |    16.12    |    21.40    |
|      aot_nvfuser       |   27.30    |     0.0     |    72.96    |
|        inductor        |   59.25    |    62.06    |    90.07    |
| inductor_no_cudagraphs |   60.72    |    56.93    |    87.93    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.42x    |    0.38x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    0.0x     |    0.85x    |
|        inductor        |   0.83x    |    0.91x    |    0.95x    |
| inductor_no_cudagraphs |   0.93x    |    1.08x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0007 | 0.9055 | 2.4052 | 1.3965 | 5.7672 | 1.3388 | | functorch_dp_cifar10 | 64 | 0.9988 | 0.9074 | 2.3977 | 1.1956 | 4.9229 | 1.3976 | | timm_efficientdet | 1 | 0.9862 | 0.8034 | 0.0 | 0.0 | 4.6793 | 1.5596 | | resnext50_32x4d | 8 | 1.0002 | 0.9499 | 1.8567 | 1.3157 | 3.5688 | 1.278 | | BERT_pytorch | 16 | 1.0136 | 0.8341 | 0.0 | 0.0 | 3.2877 | 2.4015 | | timm_vision_transformer | 8 | 1.0066 | 0.8492 | 1.7666 | 1.3598 | 3.2177 | 1.5522 | | mobilenet_v3_large | 32 | 1.0091 | 1.0011 | 1.6663 | 1.4112 | 3.0377 | 1.4251 | | drq | 1 | 1.0133 | 0.7933 | 1.9813 | 1.0842 | 2.9886 | 1.1682 | | resnet18 | 16 | 1.0048 | 0.9886 | 1.6041 | 1.3566 | 2.7561 | 1.2534 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9988 | 0.9084 | 1.8646 | 1.2136 | 2.6804 | 1.4067 | | dcgan | 32 | 0.9832 | 0.9069 | 1.6896 | 0.7378 | 2.5986 | 1.0705 | | mnasnet1_0 | 32 | 0.9994 | 1.0127 | 1.3205 | 1.4002 | 2.5883 | 1.3644 | | hf_T5_large | 2 | 1.0188 | 0.8486 | 0.0 | 0.0 | 2.4774 | 2.1438 | | squeezenet1_1 | 32 | 0.998 | 0.9503 | 1.5586 | 1.1802 | 2.4541 | 1.3115 | | hf_Albert | 8 | 1.0008 | 0.9523 | 0.7727 | 0.0 | 2.383 | 2.3241 | | timm_efficientnet | 32 | 0.9607 | 0.8108 | 1.0943 | 1.1787 | 2.1213 | 1.3011 | | lennard_jones | 1000 | 0.9747 | 0.7323 | 1.3002 | 1.0711 | 2.1181 | 1.0605 | | pytorch_struct | 200 | 0.9866 | 0.7356 | 1.0124 | 1.2019 | 2.0781 | 1.2897 | | hf_Bert | 4 | 1.0353 | 0.8574 | 0.9467 | 0.0 | 2.0076 | 1.867 | | timm_resnest | 32 | 1.0032 | 1.0176 | 0.8422 | 1.3619 | 1.9125 | 1.6921 | | hf_GPT2 | 4 | 1.0212 | 0.9836 | 0.0 | 0.0 | 1.8747 | 1.8434 | | LearningToPaint | 96 | 0.9982 | 0.997 | 1.2268 | 1.3602 | 1.8664 | 1.3087 | | hf_T5 | 8 | 0.9991 | 0.9446 | 0.0 | 0.0 | 1.8435 | 1.8464 | | hf_Bart | 4 | 1.013 | 0.8257 | 0.9947 | 0.0 | 1.8283 | 1.7424 | | resnet50 | 32 | 1.0022 | 1.0033 | 1.1198 | 1.3653 | 1.7601 | 1.3577 | | speech_transformer | 32 | 1.005 | 0.8352 | 0.0 | 0.0 | 1.7184 | 1.7328 | | shufflenet_v2_x1_0 | 128 | 1.0013 | 1.011 | 0.9852 | 1.3461 | 1.7064 | 1.4223 | | soft_actor_critic | 256 | 0.9955 | 0.7373 | 1.3303 | 1.0479 | 1.6878 | 1.0522 | | mobilenet_v2 | 96 | 0.9998 | 0.9885 | 0.7599 | 0.9415 | 1.5595 | 1.5162 | | attention_is_all_you_need_pytorch | 256 | 1.0083 | 0.9031 | 0.0 | 0.0 | 1.5195 | 1.4737 | | timm_nfnet | 128 | 0.999 | 0.9988 | 0.0 | 1.175 | 1.4917 | 1.427 | | hf_DistilBert | 8 | 1.0027 | 0.9717 | 0.7312 | 0.0 | 1.4716 | 1.4478 | | fastNLP_Bert | 6 | 0.9984 | 0.8833 | 0.7673 | 0.0 | 1.4694 | 1.4267 | | pytorch_unet | 1 | 0.9998 | 0.9923 | 0.863 | 1.1549 | 1.3424 | 1.3159 | | pytorch_stargan | 16 | 0.9953 | 1.02 | 0.9831 | 1.1314 | 1.3238 | 1.2624 | | timm_regnet | 32 | 0.9805 | 0.9278 | 0.9078 | 1.1948 | 1.3204 | 1.2422 | | timm_vovnet | 32 | 0.9219 | 0.8846 | 0.8671 | 1.138 | 1.2965 | 1.1443 | | Super_SloMo | 6 | 0.9996 | 0.9963 | 0.8863 | 0.0 | 1.2894 | 1.2559 | | vgg16 | 64 | 0.9997 | 0.9975 | 0.8574 | 0.9962 | 1.2726 | 1.2646 | | Background_Matting | 4 | 1.0001 | 1.0185 | 0.8949 | 1.1153 | 1.2236 | 1.2084 | | alexnet | 128 | 0.9989 | 0.9974 | 0.8148 | 1.0036 | 1.2138 | 1.2089 | | hf_Reformer | 4 | 0.9945 | 1.0 | 0.9452 | 0.0 | 1.1578 | 1.1463 | | hf_BigBird | 2 | 0.9895 | 0.9118 | 1.0551 | 0.0 | 1.1533 | 1.02 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9899 | 0.0 | 0.993 | 1.1506 | 1.1324 | | yolov3 | 16 | 0.9997 | 0.9912 | 0.8028 | 0.9293 | 1.1029 | 1.0793 | | tts_angular | 64 | 0.9732 | 0.9438 | 0.9915 | 1.0006 | 1.0239 | 1.0298 | | demucs | 4 | 1.0004 | 0.9999 | 1.0019 | 1.0003 | 1.0006 | 0.9994 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9963 | 0.6967 | 0.9789 | 0.9891 | 1.0305 | | hf_GPT2_large | 4 | 1.0004 | 0.9924 | 0.0 | 0.0 | 0.0 | 1.7538 | | dlrm | 2048 | 1.0705 | 1.1661 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 20.2568 | 44.9257 | nan | nan | 506.1619 | 482.7267 | | yolov3 | 16 | 3.0753 | 10.8969 | 14.8722 | 40.6957 | 441.9377 | 442.2167 | | hf_T5_large | 2 | 13.6319 | 49.8254 | nan | nan | 229.3423 | 227.025 | | speech_transformer | 32 | 1.9784 | 11.9084 | nan | nan | 160.8221 | 163.7258 | | timm_vision_transformer | 8 | 0.9712 | 6.1408 | 8.186 | 14.1079 | 153.2063 | 150.3677 | | attention_is_all_you_need_pytorch | 256 | 1.3133 | 9.8117 | nan | nan | 152.8667 | 149.1336 | | timm_vision_transformer_large | 8 | 2.8381 | 20.539 | nan | 39.248 | 143.6777 | 145.7532 | | timm_resnest | 32 | 0.6155 | 3.5441 | 4.9432 | 43.0325 | 133.5364 | 132.7415 | | pytorch_stargan | 16 | 0.4005 | 2.8635 | 3.8317 | 7.1985 | 108.2007 | 110.9196 | | BERT_pytorch | 16 | 1.751 | 10.3084 | nan | nan | 104.5928 | 105.0846 | | pytorch_struct | 200 | 0.2747 | 1.1468 | 1.7965 | 5.4271 | 81.1639 | 97.6714 | | fastNLP_Bert | 6 | 1.8325 | 9.4369 | 14.2174 | nan | 73.1299 | 71.7013 | | hf_GPT2 | 4 | 1.5632 | 8.3618 | nan | nan | 68.2801 | 67.1098 | | hf_Bart | 4 | 1.8333 | 11.8565 | 17.1832 | nan | 57.9756 | 55.1977 | | hf_T5 | 8 | 2.2224 | 11.503 | nan | nan | 54.0425 | 52.434 | | densenet121 | 4 | 2.3207 | 17.2755 | 25.975 | 126.4447 | 53.7683 | 52.2498 | | hf_BigBird | 2 | 8.256 | 17.1759 | 37.3561 | nan | 49.9161 | 32.4436 | | hf_Albert | 8 | 1.372 | 8.5864 | 12.6169 | nan | 47.6027 | 47.0628 | | mobilenet_v3_large | 32 | 0.9924 | 6.6522 | 9.1224 | 72.4189 | 47.2931 | 47.1211 | | hf_Bert | 4 | 1.6875 | 9.1713 | 12.6826 | nan | 47.1679 | 44.934 | | timm_regnet | 32 | 2.4131 | 10.885 | 25.7165 | 61.7176 | 40.6208 | 38.8687 | | timm_efficientnet | 32 | 1.9005 | 8.9344 | 19.4183 | 69.1817 | 38.2166 | 37.4503 | | hf_Reformer | 4 | 2.5095 | 5.4098 | 10.0855 | nan | 36.274 | 31.5124 | | hf_DistilBert | 8 | 0.6417 | 4.3991 | 8.6419 | nan | 36.0029 | 34.4521 | | resnext50_32x4d | 8 | 1.017 | 6.5326 | 8.8348 | 36.8789 | 33.0991 | 32.6447 | | timm_nfnet | 128 | 2.0405 | 9.7987 | nan | 38.7761 | 33.0862 | 32.2698 | | resnet50 | 32 | 0.9534 | 6.4667 | 9.1686 | 40.9829 | 32.2705 | 30.9612 | | mnasnet1_0 | 32 | 0.9236 | 6.2664 | 8.6174 | 43.8239 | 32.0749 | 31.4707 | | timm_vovnet | 32 | 1.5435 | 5.7532 | 12.3402 | 31.1761 | 31.495 | 31.0663 | | functorch_dp_cifar10 | 64 | 0.3955 | 2.6387 | 3.6111 | 6.4988 | 27.269 | 27.0683 | | shufflenet_v2_x1_0 | 128 | 1.0466 | 6.9601 | 9.9451 | 37.4913 | 22.0414 | 21.3245 | | resnet18 | 16 | 0.4597 | 2.5088 | 3.4368 | 23.4051 | 21.6736 | 22.9361 | | Background_Matting | 4 | 1.0296 | 6.1631 | 9.1283 | 42.2202 | 20.7841 | 19.4964 | | Super_SloMo | 6 | 1.0912 | 6.3965 | 8.6337 | nan | 20.3786 | 20.1479 | | mobilenet_v2 | 96 | 0.8917 | 6.0674 | 8.7936 | 42.0052 | 20.1534 | 19.6406 | | pytorch_unet | 1 | 0.472 | 2.8631 | 3.9843 | 26.121 | 9.8541 | 9.393 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4377 | 2.9424 | 4.0751 | 4.9125 | 9.6837 | 9.512 | | LearningToPaint | 96 | 0.478 | 2.5734 | 3.7702 | 30.3846 | 8.7607 | 8.2774 | | squeezenet1_1 | 32 | 0.2678 | 1.4287 | 2.0574 | 6.6429 | 5.2996 | 5.0315 | | nvidia_deeprecommender | 256 | 0.2144 | 0.6516 | 1.0085 | 2.9776 | 4.7662 | 4.6211 | | vgg16 | 64 | 0.2004 | 0.9706 | 1.455 | 3.5647 | 4.4288 | 4.1163 | | drq | 1 | 0.1632 | 0.6636 | 1.0549 | 4.402 | 4.2396 | 3.7289 | | soft_actor_critic | 256 | 0.2134 | 0.4437 | 0.6851 | 2.0419 | 3.7184 | 3.0465 | | alexnet | 128 | 0.1743 | 0.6065 | 0.9186 | 3.2161 | 3.4021 | 3.3315 | | dcgan | 32 | 0.177 | 0.5579 | 0.8046 | 4.2462 | 3.0432 | 2.7794 | | lennard_jones | 1000 | 0.1614 | 0.4458 | 0.632 | 1.4727 | 2.2724 | 2.1345 | | tts_angular | 64 | 0.2307 | 0.3099 | 0.4268 | 1.0585 | 1.958 | 1.7436 | | demucs | 4 | 0.3522 | 0.359 | 0.3559 | 0.3792 | 0.267 | 0.2785 | | hf_GPT2_large | 4 | 5.5989 | 27.5163 | nan | nan | nan | 157.5729 | | dlrm | 2048 | 0.4799 | 1.0515 | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2719 | 0.7887 | 1.2042 | 1.2318 | | hf_Albert | 8 | 0.9814 | 0.9359 | 0.3273 | nan | 1.1576 | 1.4693 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3842 | nan | 1.0536 | 1.1475 | | timm_nfnet | 128 | 0.9693 | 0.8982 | nan | 0.9445 | 1.0337 | 1.1245 | | timm_efficientdet | 1 | 1.028 | 0.8404 | nan | nan | 1.0226 | 1.0403 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9117 | 1.0074 | 1.0232 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0002 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 0.9829 | 1.1269 | | BERT_pytorch | 16 | 1.0 | 0.8822 | nan | nan | 0.9728 | 1.1011 | | hf_GPT2 | 4 | 0.9706 | 0.8625 | nan | nan | 0.9648 | 1.1252 | | Background_Matting | 4 | 1.0138 | 0.9624 | 0.3723 | 0.9813 | 0.9316 | 0.9364 | | hf_T5 | 8 | 0.9678 | 0.9371 | nan | nan | 0.9309 | 1.2521 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.999 | 0.8609 | 0.4238 | 0.8441 | 0.9289 | 0.982 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3493 | 0.85 | 0.9249 | 0.9292 | | speech_transformer | 32 | 1.0017 | 0.9174 | nan | nan | 0.9066 | 0.9109 | | hf_Bert | 4 | 0.9844 | 0.8677 | 0.3806 | nan | 0.9017 | 0.9414 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | 0.8244 | 0.8991 | 0.9038 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | 0.8494 | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3483 | 0.8623 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8376 | 0.8753 | 0.9535 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3571 | 0.8496 | 0.8678 | 0.8715 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8661 | 1.0348 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3559 | 0.7995 | 0.8659 | 0.885 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3456 | 0.7589 | 0.8611 | 0.8951 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3573 | 0.8503 | 0.856 | 0.8927 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3229 | nan | 0.8387 | 0.9058 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7073 | 0.8283 | 0.8738 | | hf_Bart | 4 | 0.9102 | 0.8321 | 0.3491 | nan | 0.8137 | 0.9762 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4543 | nan | 0.8098 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.7753 | 0.7974 | 0.9099 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3444 | 0.866 | 0.7918 | 0.8145 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4253 | 0.8882 | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3882 | 0.8176 | 0.7644 | 0.7753 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3409 | 0.8207 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8308 | 0.752 | 0.9256 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3408 | 0.7742 | 0.7513 | 0.761 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7172 | 0.7491 | 0.7534 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3827 | 0.6722 | 0.7295 | 0.8017 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9149 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3917 | 0.8871 | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3941 | 0.7276 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 1.0967 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | 0.8452 | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | 0.4307 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8645 | nan | nan | nan | 1.1351 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | tacotron2 | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0261 | 0.8388 | 0.0 | 0.0 | 4.8985 | 1.7002 | | MobileBertForMaskedLM | 32 | 1.0182 | 0.8212 | 0.0 | 0.0 | 4.4522 | 1.8054 | | MobileBertForQuestionAnswering | 64 | 1.0173 | 0.828 | 0.0 | 0.0 | 3.7208 | 1.8261 | | MT5ForConditionalGeneration | 8 | 1.0207 | 0.862 | 0.0 | 0.0 | 3.6457 | 2.5163 | | CamemBert | 1 | 1.0365 | 0.8361 | 1.763 | 0.0 | 3.6419 | 1.8243 | | DistillGPT2 | 1 | 1.0307 | 0.8606 | 1.2652 | 0.0 | 2.7645 | 2.0213 | | M2M100ForConditionalGeneration | 8 | 1.0377 | 0.8116 | 1.2351 | 0.0 | 2.4115 | 1.7558 | | GPT2ForSequenceClassification | 4 | 1.0012 | 0.9681 | 0.0 | 0.0 | 2.148 | 2.1145 | | PLBartForConditionalGeneration | 16 | 1.011 | 0.8331 | 1.0404 | 0.0 | 2.0049 | 1.774 | | MegatronBertForQuestionAnswering | 16 | 1.0299 | 0.8553 | 1.2392 | 0.0 | 1.9843 | 1.8075 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.9802 | 0.7682 | 0.0 | 1.9558 | 1.9073 | | XGLMForCausalLM | 8 | 1.0108 | 0.8287 | 0.0 | 0.0 | 1.8404 | 1.6227 | | MegatronBertForCausalLM | 16 | 1.0332 | 0.8552 | 0.989 | 0.0 | 1.8374 | 1.7637 | | ElectraForCausalLM | 32 | 1.0002 | 0.9403 | 0.7075 | 0.0 | 1.8001 | 1.8004 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9808 | 0.7748 | 0.0 | 1.7382 | 1.6968 | | MBartForConditionalGeneration | 16 | 1.0136 | 0.8352 | 0.0 | 0.0 | 1.7367 | 1.6287 | | PegasusForConditionalGeneration | 16 | 1.0104 | 0.8297 | 0.9354 | 0.0 | 1.7092 | 1.5782 | | T5Small | 1 | 1.0251 | 0.8758 | 0.0 | 0.0 | 1.6655 | 1.4565 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.8859 | 0.0 | 0.0 | 1.6534 | 1.6504 | | AlbertForMaskedLM | 4 | 1.0001 | 0.8856 | 0.0 | 0.0 | 1.6385 | 1.6467 | | Speech2Text2ForCausalLM | 128 | 1.0033 | 0.9338 | 0.7421 | 0.0 | 1.6375 | 1.6423 | | OPTForCausalLM | 32 | 1.0142 | 0.9311 | 0.7691 | 0.0 | 1.6088 | 1.5958 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9392 | 0.0 | 0.0 | 1.6069 | 1.5888 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9714 | 0.753 | 0.0 | 1.586 | 1.5642 | | BlenderbotSmallForConditionalGeneration | 64 | 1.007 | 0.9236 | 0.0 | 0.0 | 1.4742 | 1.4408 | | BartForConditionalGeneration | 2 | 1.0052 | 0.9074 | 0.0 | 0.0 | 1.4709 | 1.4385 | | BartForCausalLM | 4 | 1.0013 | 0.9616 | 0.7533 | 0.0 | 1.4604 | 1.4593 | | DistilBertForQuestionAnswering | 64 | 1.0003 | 0.9524 | 0.7402 | 0.0 | 1.4461 | 1.4039 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9836 | 0.7753 | 0.0 | 1.4344 | 1.3939 | | BertForQuestionAnswering | 128 | 1.0 | 0.9757 | 0.7782 | 0.0 | 1.421 | 1.3988 | | RobertaForCausalLM | 64 | 1.0004 | 0.9601 | 0.7491 | 0.0 | 1.4094 | 1.4059 | | PLBartForCausalLM | 32 | 1.0073 | 0.941 | 0.7748 | 0.0 | 1.3293 | 1.3277 | | BertForMaskedLM | 64 | 1.0003 | 0.9579 | 0.7352 | 0.0 | 1.3213 | 1.3142 | | BlenderbotSmallForCausalLM | 64 | 1.0016 | 0.9177 | 0.6941 | 0.0 | 1.3089 | 1.3234 | | DebertaForMaskedLM | 4 | 0.9308 | 0.7323 | 0.804 | 0.0 | 1.2934 | 1.1876 | | DistilBertForMaskedLM | 64 | 1.0004 | 0.9399 | 0.6937 | 0.0 | 1.2735 | 1.2788 | | MBartForCausalLM | 32 | 1.0044 | 0.9531 | 0.7503 | 0.0 | 1.2208 | 1.2239 | | TrOCRForCausalLM | 32 | 1.0018 | 0.954 | 0.0 | 0.0 | 1.2156 | 1.2173 | | PegasusForCausalLM | 32 | 1.0024 | 0.9517 | 0.7498 | 0.0 | 1.1992 | 1.2018 | | BigBird | 1 | 0.9917 | 0.9182 | 1.0457 | 0.0 | 1.1507 | 1.0261 | | DebertaForQuestionAnswering | 8 | 0.9958 | 0.7841 | 0.7225 | 0.0 | 1.1399 | 1.1711 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.8838 | 17.8117 | nan | nan | 194.858 | 192.779 | | DebertaForMaskedLM | 4 | 5.1245 | 13.1366 | 48.1582 | nan | 180.9537 | 111.1218 | | DebertaForQuestionAnswering | 8 | 4.919 | 13.4612 | 49.4172 | nan | 170.8762 | 106.3058 | | YituTechConvBert | 1 | 2.5334 | 14.306 | nan | nan | 140.1718 | 138.9927 | | M2M100ForConditionalGeneration | 8 | 3.5198 | 23.236 | 35.7484 | nan | 139.9786 | 144.4965 | | MobileBertForMaskedLM | 32 | 9.1132 | 41.3418 | nan | nan | 118.6315 | 113.5782 | | MobileBertForQuestionAnswering | 64 | 9.0474 | 41.8023 | nan | nan | 101.2692 | 98.7701 | | MT5ForConditionalGeneration | 8 | 3.487 | 17.0395 | nan | nan | 101.2338 | 92.3566 | | MegatronBertForCausalLM | 16 | 3.6761 | 18.9528 | 27.4653 | nan | 75.4738 | 72.4095 | | PegasusForConditionalGeneration | 16 | 3.4229 | 22.2052 | 34.924 | nan | 73.0347 | 69.7138 | | MegatronBertForQuestionAnswering | 16 | 3.7886 | 18.9105 | 27.2742 | nan | 72.2236 | 70.2981 | | BartForConditionalGeneration | 2 | 3.5687 | 22.6388 | nan | nan | 68.9738 | 67.2779 | | MBartForConditionalGeneration | 16 | 3.6286 | 23.2015 | nan | nan | 66.7567 | 65.2675 | | T5ForConditionalGeneration | 4 | 2.1607 | 11.4585 | nan | nan | 63.4196 | 64.7334 | | LayoutLMForSequenceClassification | 16 | 1.8721 | 9.8564 | 13.681 | nan | 62.7043 | 61.6224 | | T5Small | 1 | 2.1807 | 11.4387 | nan | nan | 60.7468 | 59.8897 | | PLBartForConditionalGeneration | 16 | 1.867 | 11.6157 | 17.1196 | nan | 53.589 | 52.3927 | | BlenderbotSmallForConditionalGeneration | 64 | 2.2026 | 15.2011 | nan | nan | 51.6907 | 50.4226 | | BigBird | 1 | 8.104 | 17.4303 | 37.7102 | nan | 49.115 | 32.2434 | | ElectraForCausalLM | 32 | 1.7431 | 9.4797 | 13.4844 | nan | 48.0664 | 46.9807 | | LayoutLMForMaskedLM | 16 | 1.854 | 9.957 | 14.0422 | nan | 40.4533 | 38.52 | | BertForMaskedLM | 64 | 1.6434 | 9.1451 | 13.0884 | nan | 40.4402 | 39.1127 | | ElectraForQuestionAnswering | 64 | 1.7135 | 9.4156 | 12.908 | nan | 37.6691 | 37.1638 | | RobertaForCausalLM | 64 | 1.6218 | 9.5038 | 13.2553 | nan | 35.1188 | 35.2797 | | GPT2ForSequenceClassification | 4 | 1.6102 | 8.4573 | nan | nan | 34.7604 | 33.9281 | | PegasusForCausalLM | 32 | 1.2933 | 8.4453 | 12.7766 | nan | 33.7181 | 32.0536 | | BertForQuestionAnswering | 128 | 1.6308 | 9.5382 | 12.9968 | nan | 32.5755 | 31.8769 | | MBartForCausalLM | 32 | 1.2688 | 8.6057 | 12.3507 | nan | 30.8586 | 30.114 | | TrOCRForCausalLM | 32 | 1.2698 | 8.5239 | nan | nan | 30.6852 | 29.5408 | | DistillGPT2 | 1 | 0.777 | 4.2461 | 6.035 | nan | 30.1705 | 27.0712 | | BartForCausalLM | 4 | 1.2858 | 8.7823 | 12.4256 | nan | 30.0352 | 28.6798 | | AlbertForMaskedLM | 4 | 1.4898 | 8.9688 | nan | nan | 29.804 | 28.8283 | | RobertaForQuestionAnswering | 128 | 1.5995 | 9.5258 | 13.6496 | nan | 29.1998 | 28.6459 | | AlbertForQuestionAnswering | 4 | 1.4738 | 8.9295 | nan | nan | 28.742 | 27.1453 | | DistilBertForMaskedLM | 64 | 0.5761 | 4.5213 | 8.775 | nan | 28.6222 | 27.9459 | | BlenderbotSmallForCausalLM | 64 | 0.8712 | 5.8154 | 8.3073 | nan | 27.9038 | 27.3704 | | DistilBertForQuestionAnswering | 64 | 0.6752 | 4.5682 | 8.8076 | nan | 27.8616 | 26.8576 | | OPTForCausalLM | 32 | 1.329 | 8.7064 | 19.8936 | nan | 26.9542 | 27.1551 | | CamemBert | 1 | 1.7806 | 9.6387 | 13.0603 | nan | 26.2214 | 25.541 | | Speech2Text2ForCausalLM | 128 | 0.6961 | 4.3732 | 7.0361 | nan | 22.7181 | 21.5376 | | PLBartForCausalLM | 32 | 0.7107 | 4.4018 | 6.2174 | nan | 22.5965 | 21.7864 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.5536 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.1078 | 1.5319 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3619 | nan | 1.0943 | 1.1562 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | nan | 1.0779 | 1.1637 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 1.0189 | 1.0913 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0005 | 1.0676 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0005 | 1.0676 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9594 | nan | nan | 0.995 | 1.2292 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9943 | 1.0278 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9938 | 1.0704 | | BartForConditionalGeneration | 2 | 1.0 | 0.9133 | nan | nan | 0.9913 | 1.1976 | | T5Small | 1 | 1.0 | 0.9124 | nan | nan | 0.9874 | 1.15 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3549 | nan | 0.9871 | 1.0263 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3782 | nan | 0.9868 | 1.0636 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3628 | nan | 0.9811 | 1.0366 | | RobertaForCausalLM | 64 | 0.9991 | 0.8994 | 0.3626 | nan | 0.9801 | 1.0358 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | 0.3481 | nan | 0.9718 | 1.0617 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.9642 | 1.0376 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3477 | nan | 0.948 | 1.0272 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.946 | 1.0737 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8695 | nan | nan | 0.939 | 1.0986 | | ElectraForCausalLM | 32 | 0.9996 | 0.848 | 0.3553 | nan | 0.9319 | 1.0177 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3597 | nan | 0.9269 | 1.0441 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3722 | nan | 0.9214 | 1.0168 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8499 | 0.3975 | nan | 0.921 | 1.0277 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.919 | 0.919 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9628 | 0.4377 | nan | 0.9159 | 1.0984 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3465 | nan | 0.9129 | 1.0128 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0093 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9007 | 0.3949 | nan | 0.8775 | 1.0294 | | CamemBert | 1 | 0.9989 | 0.7872 | 0.4083 | nan | 0.8654 | 0.9312 | | YituTechConvBert | 1 | 0.9718 | 0.7819 | nan | nan | 0.8618 | 0.9318 | | BigBird | 1 | 1.0008 | 0.9547 | 0.4478 | nan | 0.8348 | 1.1036 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | nan | nan | 0.8333 | 1.0324 | | DistillGPT2 | 1 | 0.9963 | 0.7527 | 0.3883 | nan | 0.8288 | 1.0239 | | M2M100ForConditionalGeneration | 8 | 0.9967 | 0.9427 | 0.4275 | nan | 0.7774 | 1.0309 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6997 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3623 | nan | 0.4498 | 1.1123 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3361 | 1.1932 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0024 | 0.0 | 0.0 | 0.0 | 2.1824 | 1.8642 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9971 | 0.0 | 1.948 | 2.1266 | 2.0956 | | regnety_002 | 128 | 0.9778 | 0.9331 | 1.1329 | 1.3739 | 2.1185 | 1.4615 | | lcnet_050 | 128 | 0.9687 | 0.945 | 0.8523 | 1.5942 | 2.0966 | 1.6246 | | ghostnet_100 | 128 | 1.0047 | 0.9954 | 0.9024 | 1.5531 | 2.0772 | 1.7315 | | twins_pcpvt_base | 64 | 1.007 | 0.9131 | 0.9227 | 1.3661 | 1.7575 | 1.6661 | | res2net101_26w_4s | 64 | 1.0023 | 0.9884 | 0.9714 | 1.4016 | 1.606 | 1.3414 | | volo_d1_224 | 64 | 0.9998 | 0.9948 | 0.0 | 1.1408 | 1.6008 | 1.5636 | | hrnet_w18 | 128 | 1.0034 | 1.0224 | 0.8621 | 1.4764 | 1.5939 | 1.4836 | | dla102 | 128 | 1.0001 | 0.9957 | 0.8371 | 1.4147 | 1.5778 | 1.5432 | | gmixer_24_224 | 128 | 0.9999 | 0.88 | 0.0 | 0.9998 | 1.5532 | 1.5487 | | nfnet_l0 | 128 | 0.9999 | 0.8113 | 0.7132 | 1.0386 | 1.5398 | 1.4649 | | resnest101e | 64 | 0.9998 | 0.9918 | 0.813 | 1.25 | 1.5192 | 1.5067 | | gmlp_s16_224 | 128 | 0.9999 | 0.9964 | 0.0 | 1.0513 | 1.5183 | 1.524 | | gluon_inception_v3 | 128 | 1.0 | 0.9963 | 0.8531 | 1.1944 | 1.5053 | 1.4695 | | adv_inception_v3 | 128 | 1.0 | 0.9965 | 0.8532 | 1.1949 | 1.5044 | 1.4699 | | inception_v3 | 128 | 0.9999 | 0.9963 | 0.8532 | 1.1947 | 1.503 | 1.4668 | | dm_nfnet_f0 | 128 | 0.9988 | 1.0 | 0.0 | 1.1763 | 1.4955 | 1.4275 | | swin_base_patch4_window7_224 | 64 | 0.9996 | 0.9572 | 0.0 | 1.04 | 1.4854 | 1.4803 | | res2net50_14w_8s | 128 | 0.9999 | 0.9936 | 0.8086 | 1.2808 | 1.4708 | 1.4281 | | cait_m36_384 | 4 | 1.0003 | 1.0099 | 0.0 | 1.0348 | 1.4624 | 1.4148 | | mobilenetv3_large_100 | 128 | 0.9727 | 0.9456 | 0.7823 | 1.3423 | 1.4581 | 1.4361 | | crossvit_9_240 | 128 | 1.0 | 0.9951 | 0.8375 | 1.0597 | 1.4522 | 1.4158 | | selecsls42b | 128 | 0.9998 | 0.9958 | 0.8397 | 1.358 | 1.4435 | 1.411 | | coat_lite_mini | 128 | 1.0001 | 0.9895 | 0.8423 | 1.2195 | 1.4234 | 1.4001 | | fbnetv3_b | 128 | 0.9542 | 0.9408 | 0.7728 | 1.2662 | 1.4175 | 1.4082 | | res2next50 | 128 | 0.9996 | 0.9957 | 0.8309 | 1.2102 | 1.4136 | 1.35 | | resmlp_12_224 | 128 | 1.0001 | 0.9978 | 0.7824 | 0.0 | 1.4002 | 1.35 | | mobilenetv2_100 | 128 | 0.9513 | 0.9412 | 0.7198 | 0.8654 | 1.4001 | 1.4317 | | jx_nest_base | 32 | 0.9997 | 0.9927 | 0.0 | 1.2258 | 1.4 | 1.3681 | | mnasnet_100 | 128 | 0.9545 | 0.9444 | 0.7881 | 1.359 | 1.3888 | 1.4578 | | mobilevit_s | 64 | 0.9734 | 0.8144 | 0.6515 | 1.1125 | 1.3767 | 1.3634 | | ese_vovnet19b_dw | 128 | 0.9691 | 0.9649 | 0.7682 | 1.2449 | 1.3751 | 1.3768 | | spnasnet_100 | 128 | 0.9456 | 0.9381 | 0.7758 | 1.3141 | 1.3666 | 1.392 | | pit_b_224 | 64 | 0.9999 | 0.9957 | 0.822 | 1.0626 | 1.3587 | 1.3526 | | fbnetc_100 | 128 | 0.9519 | 0.9449 | 0.792 | 1.3759 | 1.3517 | 1.3733 | | tf_efficientnet_b0 | 128 | 0.965 | 0.808 | 0.6661 | 1.0954 | 1.3471 | 1.3552 | | convit_base | 64 | 1.0001 | 0.9974 | 0.0 | 0.0 | 1.3465 | 1.3636 | | cspdarknet53 | 64 | 0.9428 | 0.9336 | 0.7555 | 0.9018 | 1.3293 | 1.3464 | | poolformer_m36 | 64 | 0.9997 | 0.998 | 0.8072 | 0.0 | 1.3267 | 1.2959 | | botnet26t_256 | 128 | 0.9796 | 0.974 | 0.8095 | 1.3452 | 1.3261 | 1.331 | | pnasnet5large | 16 | 1.0054 | 1.0281 | 0.853 | 1.1408 | 1.3191 | 1.2954 | | eca_botnext26ts_256 | 128 | 0.9809 | 0.8117 | 0.6713 | 1.1566 | 1.2931 | 1.2825 | | beit_base_patch16_224 | 64 | 1.0 | 0.9784 | 0.0 | 1.0451 | 1.2869 | 1.2659 | | mixer_b16_224 | 128 | 1.0002 | 0.9976 | 0.805 | 0.9603 | 1.285 | 1.2738 | | rexnet_100 | 128 | 0.9647 | 0.8505 | 0.6903 | 1.038 | 1.2843 | 1.2728 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9918 | 0.7974 | 1.0603 | 1.2826 | 1.2616 | | tinynet_a | 128 | 0.9692 | 0.8005 | 0.6569 | 1.0896 | 1.2591 | 1.2658 | | visformer_small | 128 | 0.9995 | 1.0024 | 0.8402 | 1.0846 | 1.2371 | 1.1814 | | eca_halonext26ts | 128 | 0.9809 | 0.8168 | 0.6792 | 1.1486 | 1.2156 | 0.0 | | sebotnet33ts_256 | 64 | 0.9661 | 0.8367 | 0.6798 | 1.1159 | 1.2009 | 1.2101 | | vit_base_patch16_224 | 64 | 1.0001 | 0.9941 | 0.8348 | 0.9937 | 1.1955 | 1.183 | | tf_mixnet_l | 128 | 0.9806 | 0.9091 | 0.7898 | 1.0562 | 1.1937 | 1.191 | | mixnet_l | 128 | 0.9795 | 0.9053 | 0.7943 | 1.0634 | 1.183 | 1.1779 | | dpn107 | 32 | 0.9425 | 0.9346 | 0.7546 | 0.9966 | 1.1638 | 1.1762 | | gluon_xception65 | 32 | 1.001 | 0.9834 | 0.7547 | 1.0649 | 1.1602 | 1.1251 | | repvgg_a2 | 128 | 0.9434 | 0.9339 | 0.798 | 1.1317 | 1.1399 | 1.1566 | | swsl_resnext101_32x16d | 32 | 1.0002 | 0.9812 | 0.81 | 1.0749 | 1.1328 | 1.0569 | | gernet_l | 128 | 0.9466 | 0.9383 | 0.7683 | 1.1433 | 1.0676 | 1.0767 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9231 | 1.0533 | 1.0557 | 1.0507 | | convnext_base | 64 | 0.9994 | 0.9949 | 0.0 | 1.2018 | 0.6659 | 0.6626 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_accuracy | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.8295 | 19.9737 | 31.9569 | 73.4381 | 437.2727 | 453.1626 | | coat_lite_mini | 128 | 1.2392 | 7.1588 | 10.6151 | 32.947 | 418.2742 | 410.7454 | | mobilevit_s | 64 | 1.8987 | 9.8143 | 19.6339 | 64.0554 | 338.6094 | 330.3378 | | sebotnet33ts_256 | 64 | 1.7391 | 7.9521 | 17.5542 | 68.8366 | 326.7606 | 326.3663 | | eca_halonext26ts | 128 | 1.5305 | 6.6245 | 13.4204 | 66.3073 | 303.6995 | nan | | eca_botnext26ts_256 | 128 | 1.472 | 6.254 | 13.0608 | 62.6602 | 264.9251 | 255.2382 | | xcit_large_24_p8_224 | 5 | 3.3668 | nan | nan | nan | 199.5789 | 197.7456 | | botnet26t_256 | 128 | 1.4261 | 5.6805 | 12.7039 | 50.5016 | 192.9984 | 192.1813 | | swin_base_patch4_window7_224 | 64 | 3.0112 | 16.9517 | nan | 72.5605 | 185.2649 | 180.2708 | | jx_nest_base | 32 | 1.8521 | 12.2298 | nan | 52.0403 | 173.585 | 173.3859 | | convnext_base | 64 | 1.6052 | 9.4704 | nan | 36.1856 | 160.721 | 155.0981 | | cait_m36_384 | 4 | 3.5529 | 25.7848 | nan | 62.8823 | 149.5356 | 148.1344 | | hrnet_w18 | 128 | 6.912 | 42.085 | 73.7728 | 458.635 | 130.7424 | 123.9546 | | resnest101e | 64 | 3.5049 | 22.7989 | 35.9505 | 108.9619 | 128.6373 | 119.2999 | | crossvit_9_240 | 128 | 1.7143 | 11.4764 | 17.0618 | 37.2511 | 125.1712 | 122.8947 | | volo_d1_224 | 64 | 1.3509 | 10.3183 | nan | 39.4754 | 106.0502 | 104.6853 | | pnasnet5large | 16 | 5.0431 | 30.6895 | 52.5155 | 192.2498 | 104.7165 | 100.8943 | | visformer_small | 128 | 1.012 | 5.2996 | 7.8662 | 30.9644 | 95.0057 | 93.4595 | | pit_b_224 | 64 | 1.1784 | 7.1844 | 10.8127 | 25.351 | 89.8326 | 89.2225 | | gmlp_s16_224 | 128 | 1.3138 | 9.9217 | nan | 22.6324 | 78.2899 | 76.1123 | | res2net101_26w_4s | 64 | 3.3603 | 23.0953 | 36.8235 | 120.3506 | 68.8668 | 64.0399 | | tnt_s_patch16_224 | 128 | 1.8667 | 14.5415 | nan | 38.6474 | 65.3885 | 60.6268 | | res2net50_14w_8s | 128 | 2.9853 | 20.3292 | 31.8737 | 139.1154 | 62.7041 | 58.2307 | | gmixer_24_224 | 128 | 1.5031 | 11.0912 | nan | 28.4307 | 60.5123 | 58.8985 | | convit_base | 64 | 1.2706 | 8.2016 | nan | nan | 59.2515 | 57.867 | | gluon_xception65 | 32 | 2.3465 | 15.3302 | 23.0085 | 64.4876 | 54.7008 | 51.1978 | | poolformer_m36 | 64 | 1.9528 | 11.6525 | 17.9511 | nan | 51.8802 | 48.9452 | | swsl_resnext101_32x16d | 32 | 1.9419 | 13.0524 | 19.494 | 53.4265 | 48.2901 | 45.2918 | | dpn107 | 32 | 4.17 | 18.515 | 53.4895 | 102.1895 | 48.0665 | 45.8321 | | fbnetv3_b | 128 | 3.5692 | 14.6862 | 37.7741 | 101.1168 | 44.3318 | 41.8336 | | vit_base_patch16_224 | 64 | 1.0368 | 6.6764 | 8.9831 | 14.3711 | 44.3093 | 42.9548 | | deit_base_distilled_patch16_224 | 64 | 1.068 | 6.3507 | 9.2566 | 15.1017 | 44.0566 | 43.5425 | | resmlp_12_224 | 128 | 0.7512 | 4.3433 | 7.979 | nan | 42.344 | 45.1063 | | mixer_b16_224 | 128 | 0.9015 | 5.0322 | 8.4324 | 16.6798 | 41.9174 | 40.6346 | | tf_mixnet_l | 128 | 5.8789 | 15.9108 | 33.3269 | 87.9298 | 41.8465 | 38.5803 | | inception_v3 | 128 | 1.7317 | 11.9334 | 17.8615 | 99.7468 | 40.8376 | 39.0428 | | gluon_inception_v3 | 128 | 1.7562 | 12.1734 | 17.6723 | 99.8994 | 40.6634 | 38.1773 | | adv_inception_v3 | 128 | 1.7536 | 11.9695 | 17.8052 | 99.6723 | 40.4868 | 38.1699 | | ghostnet_100 | 128 | 3.1544 | 12.7088 | 18.0779 | 91.6621 | 39.6488 | 36.863 | | beit_base_patch16_224 | 64 | 1.3538 | 7.5436 | nan | 18.3732 | 39.4215 | 38.2418 | | mixnet_l | 128 | 5.5244 | 15.4903 | 32.9421 | 87.3631 | 39.3911 | 37.456 | | dla102 | 128 | 1.9687 | 13.3249 | 20.2285 | 87.1116 | 39.2475 | 37.7842 | | convmixer_768_32 | 32 | 1.5356 | 8.8642 | 13.0816 | 18.3079 | 38.6408 | 36.5099 | | res2next50 | 128 | 1.8331 | 11.5784 | 17.2442 | 86.7869 | 36.482 | 33.143 | | dm_nfnet_f0 | 128 | 2.1941 | 9.5482 | nan | 39.4523 | 34.5198 | 32.6655 | | rexnet_100 | 128 | 2.0888 | 9.6586 | 21.6599 | 117.7439 | 32.163 | 31.0996 | | tinynet_a | 128 | 2.2105 | 10.5963 | 24.7563 | 79.6911 | 31.7316 | 30.1822 | | cspdarknet53 | 64 | 2.3986 | 9.7328 | 23.8357 | 41.1186 | 28.5675 | 26.8411 | | tf_efficientnet_b0 | 128 | 1.9705 | 8.9001 | 20.3522 | 78.4259 | 27.4314 | 26.2041 | | nfnet_l0 | 128 | 1.9057 | 9.4882 | 13.6435 | 35.5309 | 26.9764 | 25.2746 | | fbnetc_100 | 128 | 2.1718 | 8.7703 | 21.6369 | 60.2288 | 26.2368 | 25.3805 | | spnasnet_100 | 128 | 2.1772 | 8.5425 | 21.2303 | 57.7374 | 26.0878 | 24.6668 | | mobilenetv3_large_100 | 128 | 1.7449 | 7.4376 | 16.7432 | 82.6857 | 24.4636 | 23.2897 | | mobilenetv2_100 | 128 | 1.7028 | 7.0597 | 16.6933 | 41.1163 | 22.69 | 21.276 | | mnasnet_100 | 128 | 1.7191 | 7.0945 | 16.3612 | 51.0751 | 22.6162 | 20.8213 | | gernet_l | 128 | 2.0842 | 8.2421 | 19.7152 | 44.6438 | 22.1882 | 21.2112 | | regnety_002 | 128 | 1.7361 | 7.658 | 17.1152 | 56.7209 | 22.0035 | 20.9952 | | repvgg_a2 | 128 | 2.0742 | 7.9069 | 18.9615 | 62.6982 | 21.7179 | 20.452 | | selecsls42b | 128 | 0.9022 | 5.3253 | 7.8206 | 50.9076 | 19.4225 | 18.3737 | | lcnet_050 | 128 | 1.1115 | 4.4334 | 8.8827 | 38.6244 | 15.6594 | 14.8776 | | ese_vovnet19b_dw | 128 | 1.1125 | 4.219 | 8.3426 | 39.3669 | 15.3676 | 14.1028 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.7887 | 1.3707 | 1.4015 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | nan | 0.9029 | 1.3139 | 1.3772 | | gmlp_s16_224 | 128 | 0.9937 | 0.9715 | nan | 0.9188 | 1.2841 | 1.2998 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2666 | 0.8392 | 1.173 | 1.1918 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | 1.1722 | 1.1608 | 1.2789 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | 0.7848 | 1.1578 | 1.2186 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.285 | 0.8648 | 1.1475 | 1.1687 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.776 | 1.1068 | 1.2102 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.267 | 0.7762 | 1.1053 | nan | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1022 | 1.1162 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | 0.9418 | 1.0703 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | 0.9685 | 1.0556 | 1.0626 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0529 | 1.1534 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | nan | 0.9443 | 1.0336 | 1.124 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.8142 | 1.0332 | 1.0762 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | 0.8587 | 1.0138 | 1.0718 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9129 | 1.0048 | 1.021 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | 0.9298 | 1.0004 | 1.0447 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | 0.9714 | 0.9746 | 0.9788 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.8179 | 0.9746 | 1.2067 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | 0.802 | 0.9699 | 1.0818 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | 0.79 | 0.9645 | 0.9776 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.9026 | 0.9489 | 0.9832 | | dla102 | 128 | 0.9694 | 0.912 | 0.3363 | 0.9381 | 0.9431 | 0.9502 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3469 | 0.8884 | 0.9382 | 1.0521 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9319 | 0.9931 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | 0.8365 | 0.9314 | 1.0486 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | 0.9442 | 0.929 | 0.9775 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | 0.8487 | 0.9113 | 0.9354 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 0.7555 | 0.9089 | 0.9818 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | 0.8814 | 0.9069 | 0.9596 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3348 | 0.8581 | 0.8969 | 0.938 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | 0.8524 | 0.8964 | 0.9224 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.8641 | 0.8948 | 0.916 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8772 | 0.8927 | 0.9188 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | 0.8854 | 0.8924 | 0.8971 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3594 | 0.8801 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8794 | 0.8911 | 0.8966 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.9146 | 0.8905 | 0.9028 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.8299 | 0.876 | 0.9007 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8285 | 0.8697 | 0.8972 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | 0.7737 | 0.8653 | 0.9722 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8158 | 0.8621 | 0.8897 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8602 | 0.8784 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.7908 | 0.8512 | 0.8583 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7708 | 0.8503 | 0.898 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3348 | 0.8252 | 0.8503 | 0.8698 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3306 | 0.7352 | 0.8387 | 0.8542 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.7559 | 0.8309 | 0.8769 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7515 | 0.8245 | 0.8627 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 0.8842 | 0.8174 | 1.0986 | | convnext_base | 64 | 1.003 | 0.9263 | nan | 0.7349 | 0.8166 | 0.9866 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | nan | 0.8092 | 0.8236 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 0.6593 | 0.8006 | 1.035 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.6789 | 0.7905 | 0.8278 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | 0.8451 | 0.7566 | 0.9252 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.7354 | 0.7449 | 0.8293 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | 0.86 | 0.6708 | 0.8619 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/a7M7Sx8.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/XQX6Okc.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/jqS2Nq1.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 40/55 | 47%, 20/43  | 39%, 24/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/43   | 89%, 54/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 91%, 50/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.48x    |    1.28x    |    1.25x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.08    |    2.22     |    1.88     |
|       aot_eager        |    6.92    |    9.05     |    8.70     |
|     aot_cudagraphs     |    8.23    |    18.64    |    15.25    |
|      aot_nvfuser       |   20.32    |    9.60     |    50.01    |
|        inductor        |   62.17    |    52.98    |    73.89    |
| inductor_no_cudagraphs |   64.61    |    49.17    |    72.74    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.82x    |    0.72x    |    0.97x    |
| inductor_no_cudagraphs |   0.94x    |    0.96x    |    1.02x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0028 | 0.9993 | 2.3219 | 1.443 | 5.4438 | 1.3058 | | timm_efficientdet | 1 | 0.9824 | 0.8845 | 0.0 | 0.0 | 4.2758 | 1.526 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.9777 | 2.1532 | 1.1969 | 3.6923 | 1.2407 | | timm_vision_transformer | 8 | 1.0068 | 0.9447 | 1.5339 | 1.3578 | 2.5716 | 1.4121 | | drq | 1 | 1.0315 | 0.8503 | 1.3708 | 1.0638 | 2.4195 | 1.0737 | | resnext50_32x4d | 8 | 1.0007 | 1.079 | 1.2092 | 1.3669 | 2.0959 | 1.2162 | | mobilenet_v3_large | 32 | 1.0078 | 1.1087 | 1.0365 | 1.3781 | 1.9864 | 1.3795 | | BERT_pytorch | 16 | 1.0104 | 0.8854 | 0.0 | 0.0 | 1.9168 | 1.9012 | | resnet18 | 16 | 1.006 | 1.1021 | 1.168 | 1.3958 | 1.8428 | 1.2045 | | pytorch_struct | 200 | 0.9977 | 0.7381 | 0.8734 | 0.8906 | 1.827 | 1.1633 | | lennard_jones | 1000 | 0.976 | 0.8293 | 1.0524 | 1.0142 | 1.818 | 0.9452 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.9377 | 1.2471 | 1.1785 | 1.7636 | 1.3013 | | squeezenet1_1 | 32 | 0.9979 | 0.9923 | 1.0527 | 1.1557 | 1.7406 | 1.2709 | | hf_Albert | 8 | 1.0015 | 0.9976 | 0.752 | 0.0 | 1.6466 | 1.6414 | | dcgan | 32 | 0.9829 | 1.0102 | 1.2585 | 1.1788 | 1.6306 | 1.0725 | | hf_T5_large | 2 | 1.0248 | 0.9068 | 0.0 | 0.0 | 1.5833 | 1.5731 | | speech_transformer | 32 | 1.0038 | 0.9068 | 0.0 | 0.0 | 1.5684 | 1.544 | | shufflenet_v2_x1_0 | 128 | 1.0005 | 1.0532 | 0.8062 | 1.1931 | 1.53 | 1.3689 | | timm_resnest | 32 | 0.9996 | 1.0027 | 0.8044 | 1.1815 | 1.5191 | 1.4517 | | timm_nfnet | 128 | 0.9993 | 0.9999 | 0.0 | 1.2122 | 1.4726 | 1.4222 | | mnasnet1_0 | 32 | 0.9993 | 1.0945 | 0.8568 | 1.2932 | 1.4577 | 1.2734 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.978 | 0.0 | 0.0 | 1.4527 | 1.4479 | | mobilenet_v2 | 96 | 0.9998 | 1.0003 | 0.7313 | 1.0443 | 1.4287 | 1.4088 | | hf_GPT2 | 4 | 1.0046 | 0.9827 | 0.738 | 0.0 | 1.4239 | 1.4306 | | soft_actor_critic | 256 | 0.9921 | 0.7715 | 1.1241 | 0.9985 | 1.4185 | 0.9565 | | resnet50_quantized_qat | 32 | 1.0019 | 0.9619 | 0.0 | 0.0 | 1.401 | 1.3947 | | fastNLP_Bert | 6 | 0.9997 | 0.9761 | 0.7528 | 0.0 | 1.3686 | 1.3445 | | timm_efficientnet | 32 | 0.9551 | 0.8076 | 0.7031 | 1.0629 | 1.3353 | 1.2011 | | LearningToPaint | 96 | 1.0048 | 1.0586 | 0.8687 | 1.2057 | 1.2627 | 1.2074 | | pytorch_unet | 1 | 1.0001 | 0.9982 | 0.8464 | 1.0765 | 1.2042 | 1.1861 | | resnet50 | 32 | 0.9994 | 0.9937 | 0.7608 | 1.1612 | 1.204 | 1.1695 | | Super_SloMo | 6 | 1.0003 | 0.9974 | 0.8669 | 0.0 | 1.18 | 1.1645 | | hf_Bart | 4 | 1.0127 | 0.9757 | 0.0 | 0.0 | 1.1721 | 1.1653 | | vgg16 | 64 | 1.0 | 0.999 | 0.859 | 0.9973 | 1.1707 | 1.1652 | | alexnet | 128 | 0.9991 | 0.998 | 0.8031 | 1.0004 | 1.163 | 1.1651 | | hf_Bert | 4 | 1.0214 | 0.944 | 0.7306 | 0.0 | 1.1575 | 1.1396 | | hf_DistilBert | 8 | 0.9999 | 0.9569 | 0.6872 | 0.0 | 1.1481 | 1.1546 | | timm_regnet | 32 | 0.9653 | 0.9617 | 0.7795 | 1.096 | 1.1283 | 1.0941 | | pytorch_stargan | 16 | 0.9997 | 0.983 | 0.866 | 0.9896 | 1.1189 | 1.0913 | | Background_Matting | 4 | 1.0006 | 1.0218 | 0.866 | 1.0816 | 1.1153 | 1.1069 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9267 | 0.0 | 1.1095 | 1.1343 | | hf_BigBird | 2 | 0.9915 | 0.939 | 0.9612 | 0.0 | 1.0921 | 1.0042 | | yolov3 | 16 | 1.0 | 0.9954 | 0.7893 | 1.1839 | 1.0795 | 1.0647 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9726 | 0.0 | 0.0 | 1.047 | 1.033 | | timm_vision_transformer_large | 8 | 0.9982 | 0.9912 | 0.0 | 0.9805 | 1.044 | 1.0331 | | tts_angular | 64 | 0.9937 | 0.964 | 0.9933 | 1.0231 | 1.0136 | 1.0218 | | timm_vovnet | 32 | 0.9102 | 0.9045 | 0.7132 | 0.9774 | 1.0069 | 1.0176 | | dlrm | 2048 | 1.0064 | 1.0734 | 0.0 | 0.0 | 1.0006 | 0.0 | | demucs | 4 | 0.9997 | 0.9998 | 0.999 | 0.9999 | 1.0 | 1.0007 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.585 | 0.942 | 0.904 | 0.9643 | | hf_GPT2_large | 4 | 1.0004 | 0.9805 | 0.0 | 0.0 | 0.0 | 1.3706 | | hf_T5 | 8 | 1.0002 | 0.9932 | 0.0 | 0.0 | 0.0 | 1.5515 | | tacotron2 | 64 | 0.981 | 0.8581 | 0.0 | 0.0 | 0.0 | 0.9362 | | hf_Longformer | 2 | 0.9701 | 0.9013 | 0.8196 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5344 | 38.4011 | nan | nan | 484.0577 | 488.767 | | yolov3 | 16 | 2.7711 | 8.6894 | 11.9084 | 43.4046 | 419.4861 | 419.8955 | | hf_T5_large | 2 | 13.2998 | 41.15 | nan | nan | 205.3317 | 202.2279 | | timm_vision_transformer | 8 | 0.7808 | 4.1474 | 5.8215 | 9.3655 | 153.43 | 160.5928 | | speech_transformer | 32 | 1.5424 | 8.2938 | nan | nan | 152.3735 | 147.9389 | | timm_resnest | 32 | 0.5383 | 2.6812 | 3.7424 | 35.1306 | 150.1654 | 145.0659 | | attention_is_all_you_need_pytorch | 256 | 1.0734 | 7.1292 | nan | nan | 137.7387 | 139.7203 | | timm_vision_transformer_large | 8 | 2.223 | 13.8751 | nan | 24.351 | 126.2802 | 123.9619 | | pytorch_stargan | 16 | 0.3789 | 2.3643 | 3.1326 | 3.9188 | 107.0355 | 104.0851 | | pytorch_struct | 200 | 0.2366 | 0.7827 | 1.3456 | 4.0715 | 99.505 | 98.1575 | | BERT_pytorch | 16 | 1.4194 | 7.614 | nan | nan | 92.0393 | 92.0811 | | fastNLP_Bert | 6 | 1.4306 | 6.6169 | 10.0451 | nan | 65.652 | 63.418 | | hf_GPT2 | 4 | 1.2488 | 6.1179 | 8.8738 | nan | 63.5447 | 63.521 | | hf_Bart | 4 | 1.3924 | 8.089 | nan | nan | 49.9676 | 49.9717 | | densenet121 | 4 | 1.9897 | 13.3477 | 20.1678 | 88.3763 | 45.0957 | 43.7205 | | mobilenet_v3_large | 32 | 0.8275 | 4.8204 | 6.7604 | 53.5764 | 44.9158 | 46.9735 | | hf_Albert | 8 | 1.0066 | 5.8746 | 8.5532 | nan | 41.987 | 41.132 | | hf_BigBird | 2 | 7.3861 | 13.5387 | 29.953 | nan | 41.2734 | 26.6352 | | resnet50_quantized_qat | 32 | 1.061 | 9.0448 | nan | nan | 39.8902 | 40.3176 | | hf_Bert | 4 | 1.312 | 6.2693 | 8.8293 | nan | 39.8395 | 38.7377 | | timm_regnet | 32 | 2.173 | 8.4238 | 20.7651 | 47.6157 | 37.2439 | 35.16 | | hf_Reformer | 4 | 2.3483 | nan | 9.1124 | nan | 36.065 | 30.7238 | | timm_efficientnet | 32 | 1.6787 | 6.665 | 16.1146 | 52.4346 | 34.2419 | 34.4653 | | mnasnet1_0 | 32 | 0.7461 | 4.4921 | 6.4014 | 30.714 | 31.0909 | 30.7546 | | resnet50 | 32 | 0.7937 | 4.9477 | 6.925 | 32.2699 | 31.0875 | 29.832 | | hf_DistilBert | 8 | 0.4278 | 3.0834 | 6.0696 | nan | 30.4362 | 29.5285 | | resnext50_32x4d | 8 | 0.8239 | 4.9203 | 6.8365 | 28.5464 | 30.2931 | 30.0266 | | timm_vovnet | 32 | 1.4222 | 4.5909 | 10.441 | 23.5649 | 30.0127 | 29.7463 | | timm_nfnet | 128 | 1.8844 | 7.7171 | nan | 29.8502 | 29.8712 | 28.8763 | | mobilenet_v2_quantized_qat | 96 | 1.1759 | 8.8754 | nan | nan | 27.0997 | 27.2946 | | functorch_dp_cifar10 | 64 | 0.3232 | 1.9699 | 2.8309 | 5.5366 | 26.1947 | 24.9937 | | resnet18 | 16 | 0.3858 | 1.8912 | 2.6752 | 17.5591 | 23.2902 | 20.4971 | | shufflenet_v2_x1_0 | 128 | 0.8656 | 5.4261 | 7.6883 | 26.8524 | 18.5748 | 17.9867 | | Super_SloMo | 6 | 0.9695 | 5.0542 | 6.7627 | nan | 17.3419 | 16.4668 | | Background_Matting | 4 | 0.6979 | 4.5367 | 6.7144 | 29.2894 | 16.7635 | 16.0163 | | mobilenet_v2 | 96 | 0.7343 | 4.4782 | 6.6781 | 37.1045 | 16.669 | 16.3002 | | pytorch_unet | 1 | 0.4223 | 2.1063 | 2.9975 | 19.6418 | 8.2272 | 7.7305 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3535 | 2.202 | 3.0539 | 3.8439 | 8.1719 | 8.0926 | | LearningToPaint | 96 | 0.4124 | 1.9651 | 2.8324 | 23.8303 | 7.2019 | 6.8944 | | squeezenet1_1 | 32 | 0.2563 | 0.9557 | 1.3863 | 4.5328 | 4.0598 | 3.8616 | | nvidia_deeprecommender | 256 | 0.1895 | 0.4298 | 0.6854 | 2.4393 | 4.0142 | 3.7143 | | drq | 1 | 0.1402 | 0.4424 | 0.8198 | 3.4662 | 3.7694 | 3.1945 | | vgg16 | 64 | 0.1869 | 0.6441 | 1.0464 | 2.4609 | 3.6811 | 3.2422 | | dlrm | 2048 | 0.4444 | 0.8198 | nan | nan | 3.4517 | nan | | soft_actor_critic | 256 | 0.2031 | 0.3372 | 0.4948 | 1.5206 | 3.0611 | 2.6231 | | alexnet | 128 | 0.1421 | 0.4161 | 0.6606 | 2.3558 | 2.9654 | 2.6911 | | dcgan | 32 | 0.1641 | 0.4494 | 0.6683 | 3.7309 | 2.678 | 2.4053 | | lennard_jones | 1000 | 0.1381 | 0.289 | 0.4429 | 1.0648 | 1.9631 | 1.736 | | tts_angular | 64 | 0.2061 | 0.2786 | 0.3976 | 1.0162 | 1.8605 | 1.6749 | | demucs | 4 | 0.2929 | 0.2934 | 0.2977 | 0.2969 | 0.2011 | 0.1967 | | hf_GPT2_large | 4 | 4.9818 | 19.3363 | nan | nan | nan | 143.1625 | | tacotron2 | 64 | 16.7009 | 28.6252 | nan | nan | nan | 106.378 | | hf_T5 | 8 | 2.1787 | 9.4406 | nan | nan | nan | 44.804 | | hf_Longformer | 2 | 5.7342 | 13.862 | 78.3703 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4314 | 1.4314 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.4036 | 1.4036 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2637 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1858 | 1.1912 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | nan | 0.9703 | 1.1094 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.8021 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.896 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.3919 | 0.9169 | 0.8848 | 0.9654 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.8964 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8616 | 1.0285 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3435 | 0.8551 | 0.8562 | 0.9307 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | hf_Bart | 4 | 0.9617 | 0.8598 | nan | nan | 0.8503 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | nan | 0.8354 | 1.0952 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3596 | 0.8203 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4301 | nan | 0.8211 | 1.0393 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.7478 | 0.8187 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6503 | 0.7133 | 0.7462 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7048 | 0.985 | | dlrm | 2048 | 0.7302 | 0.7305 | nan | nan | 0.7035 | nan | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3593 | 0.6971 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6596 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4682 | 0.6183 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1507 | | tacotron2 | 64 | 0.9906 | 1.093 | nan | nan | nan | 1.1496 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0285 | 0.9414 | 0.0 | 0.0 | 3.7345 | 1.5254 | | CamemBert | 1 | 1.0493 | 0.9732 | 1.3251 | 0.0 | 2.3889 | 1.5405 | | MT5ForConditionalGeneration | 8 | 1.0272 | 0.9263 | 0.0 | 0.0 | 2.2531 | 1.9848 | | DistillGPT2 | 1 | 1.0322 | 0.9458 | 1.0657 | 0.0 | 2.099 | 1.9009 | | MobileBertForMaskedLM | 32 | 1.023 | 0.9232 | 0.0 | 0.0 | 1.9829 | 1.574 | | GoogleFnet | 1 | 0.9985 | 0.8173 | 0.9815 | 1.1247 | 1.9188 | 1.1214 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.6662 | 1.6568 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9667 | 0.0 | 0.0 | 1.4388 | 1.4275 | | M2M100ForConditionalGeneration | 8 | 1.0412 | 0.8942 | 1.0013 | 0.0 | 1.4178 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.024 | 0.9187 | 0.0 | 0.0 | 1.4036 | 1.2789 | | ElectraForCausalLM | 32 | 1.0004 | 0.9312 | 0.0 | 0.0 | 1.3702 | 1.4028 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9844 | 0.0 | 0.0 | 1.3541 | 1.3368 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0018 | 0.0 | 0.0 | 1.2567 | 1.2522 | | AlbertForMaskedLM | 4 | 0.9993 | 0.9996 | 0.0 | 0.0 | 1.25 | 1.2519 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9892 | 0.7379 | 0.0 | 1.2473 | 1.2318 | | T5Small | 1 | 1.0191 | 0.9543 | 0.0 | 0.0 | 1.2442 | 1.2308 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9613 | 0.0 | 0.0 | 1.1874 | 1.188 | | OPTForCausalLM | 32 | 1.0037 | 0.932 | 0.0 | 0.0 | 1.1825 | 1.1983 | | XGLMForCausalLM | 8 | 1.0128 | 0.9394 | 0.0 | 0.0 | 1.1706 | 1.1753 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.971 | 0.0 | 0.0 | 1.1633 | 1.1716 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.985 | 0.7131 | 0.0 | 1.1444 | 1.1262 | | RobertaForCausalLM | 64 | 1.0004 | 0.9637 | 0.7465 | 0.0 | 1.1133 | 1.1212 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.9259 | 0.6593 | 0.0 | 1.11 | 1.1484 | | BigBird | 1 | 0.9894 | 0.937 | 0.991 | 0.0 | 1.1023 | 1.0034 | | BartForCausalLM | 4 | 1.0007 | 0.9668 | 0.0 | 0.0 | 1.0962 | 1.1067 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9887 | 0.0 | 0.0 | 1.0962 | 1.0896 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0104 | 0.7572 | 0.0 | 1.0947 | 1.0716 | | MBartForConditionalGeneration | 16 | 1.0102 | 0.9766 | 0.0 | 0.0 | 1.0887 | 1.0775 | | DebertaForMaskedLM | 4 | 0.9321 | 0.8111 | 0.7317 | 0.0 | 1.0885 | 1.0732 | | MegatronBertForCausalLM | 16 | 1.0332 | 1.0027 | 0.7578 | 0.0 | 1.087 | 1.0785 | | PegasusForConditionalGeneration | 16 | 1.0101 | 0.9819 | 0.7569 | 0.0 | 1.0857 | 1.0825 | | BertForQuestionAnswering | 128 | 0.9997 | 0.9882 | 0.0 | 0.0 | 1.0722 | 1.0661 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9942 | 0.0 | 0.0 | 1.0696 | 1.0709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9265 | 0.0 | 0.0 | 1.0628 | 1.0696 | | DebertaForQuestionAnswering | 8 | 0.9976 | 0.9917 | 0.6821 | 0.0 | 1.0623 | 1.2025 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9519 | 0.7122 | 0.0 | 1.0362 | 1.0546 | | BertForMaskedLM | 64 | 1.0003 | 0.9524 | 0.7302 | 0.0 | 1.0338 | 1.0381 | | PLBartForCausalLM | 32 | 1.0055 | 0.9348 | 0.7321 | 0.0 | 1.0224 | 1.0494 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9105 | 0.6827 | 0.0 | 1.0131 | 1.0345 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9556 | 0.0 | 0.0 | 0.9981 | 1.0096 | | MBartForCausalLM | 32 | 1.0013 | 0.9555 | 0.0 | 0.0 | 0.9967 | 1.0069 | | PegasusForCausalLM | 32 | 0.9998 | 0.953 | 0.7325 | 0.0 | 0.9888 | 1.0008 | | AllenaiLongformerBase | 1 | 0.953 | 0.7915 | 0.7884 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2364 | 12.2125 | nan | nan | 203.4086 | 201.0863 | | DebertaForMaskedLM | 4 | 4.684 | 11.0814 | 44.7781 | nan | 163.7151 | 106.9608 | | DebertaForQuestionAnswering | 8 | 4.5483 | 11.6349 | 43.993 | nan | 152.0741 | 118.2059 | | M2M100ForConditionalGeneration | 8 | 2.7543 | 15.4794 | 23.643 | nan | 128.0751 | 124.2115 | | YituTechConvBert | 1 | 2.0946 | 9.5284 | nan | nan | 115.4649 | 119.3641 | | MT5ForConditionalGeneration | 8 | 3.4744 | 13.6659 | nan | nan | 90.4534 | 91.1223 | | MobileBertForMaskedLM | 32 | 7.7855 | 27.1609 | nan | nan | 88.9601 | 85.7795 | | MobileBertForQuestionAnswering | 64 | 7.9327 | 27.5186 | nan | nan | 74.7874 | 71.876 | | MegatronBertForCausalLM | 16 | 3.0219 | 12.5327 | 19.6699 | nan | 61.5191 | 59.8845 | | MegatronBertForQuestionAnswering | 16 | 3.0691 | 13.2977 | 19.1034 | nan | 60.2609 | 58.2808 | | LayoutLMForSequenceClassification | 16 | 1.6734 | 6.6917 | 10.1343 | nan | 59.7267 | 60.187 | | T5ForConditionalGeneration | 4 | 2.1399 | 8.8895 | nan | nan | 58.3394 | 57.0848 | | PegasusForConditionalGeneration | 16 | 2.6227 | 14.7158 | 24.2283 | nan | 58.1897 | 54.3056 | | BartForConditionalGeneration | 2 | 2.8248 | 15.0065 | nan | nan | 57.0652 | 54.7753 | | T5Small | 1 | 2.1902 | 8.9903 | nan | nan | 55.4364 | 53.2137 | | MBartForConditionalGeneration | 16 | 2.7868 | 15.512 | nan | nan | 54.3119 | 53.1455 | | PLBartForConditionalGeneration | 16 | 1.3887 | 8.298 | nan | nan | 47.5246 | 46.3964 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7139 | 10.0168 | nan | nan | 43.6075 | 41.5748 | | BigBird | 1 | 7.296 | 13.5333 | 29.6711 | nan | 40.7238 | 26.8699 | | ElectraForCausalLM | 32 | 1.2891 | 6.2441 | nan | nan | 40.6712 | 39.969 | | DistillGPT2 | 1 | 0.6422 | 3.1221 | 4.4918 | nan | 33.8479 | 32.6814 | | LayoutLMForMaskedLM | 16 | 1.6131 | 6.6316 | nan | nan | 32.8126 | 32.5964 | | BertForMaskedLM | 64 | 1.2973 | 6.3901 | 9.4361 | nan | 32.777 | 31.6779 | | ElectraForQuestionAnswering | 64 | 1.3222 | 6.4111 | nan | nan | 32.5117 | 31.4854 | | GPT2ForSequenceClassification | 4 | 1.2751 | 6.1953 | nan | nan | 32.0765 | 31.1399 | | RobertaForCausalLM | 64 | 1.3104 | 6.1902 | 9.2915 | nan | 28.0396 | 27.4422 | | BertForQuestionAnswering | 128 | 1.3166 | 6.2802 | nan | nan | 27.7294 | 27.1936 | | PegasusForCausalLM | 32 | 1.0161 | 5.707 | 8.775 | nan | 27.1087 | 25.1376 | | MBartForCausalLM | 32 | 0.9522 | 5.5767 | nan | nan | 25.4243 | 24.6154 | | RobertaForQuestionAnswering | 128 | 1.3205 | 6.387 | nan | nan | 24.5494 | 23.8515 | | TrOCRForCausalLM | 32 | 0.9241 | 5.5701 | nan | nan | 24.4333 | 24.1797 | | BartForCausalLM | 4 | 1.0079 | 5.6176 | nan | nan | 24.3593 | 23.6588 | | AlbertForMaskedLM | 4 | 1.1157 | 5.8703 | nan | nan | 23.8611 | 23.0601 | | GoogleFnet | 1 | 0.7904 | 3.3495 | 10.4595 | 9.6049 | 23.8114 | 16.1369 | | BlenderbotSmallForCausalLM | 64 | 0.6439 | 3.7467 | 5.6889 | nan | 23.625 | 22.6972 | | DistilBertForMaskedLM | 64 | 0.4729 | 2.9552 | 5.8879 | nan | 23.0127 | 22.634 | | AlbertForQuestionAnswering | 4 | 1.1461 | 5.9483 | nan | nan | 22.7287 | 21.5179 | | OPTForCausalLM | 32 | 1.0353 | 5.881 | nan | nan | 21.8562 | 20.7457 | | DistilBertForQuestionAnswering | 64 | 0.4816 | 3.0171 | 5.9235 | nan | 21.8186 | 22.1039 | | CamemBert | 1 | 1.38 | 6.1479 | 8.5874 | nan | 21.7413 | 21.2151 | | Speech2Text2ForCausalLM | 128 | 0.577 | 2.9045 | 4.6098 | nan | 19.6271 | 18.24 | | PLBartForCausalLM | 32 | 0.4938 | 2.9552 | 4.3734 | nan | 18.8954 | 18.2071 | | AllenaiLongformerBase | 1 | 5.9078 | 14.4262 | 80.0409 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0596 | 1.1223 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0108 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9984 | 0.8218 | 0.3795 | nan | 0.8173 | 0.9383 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | nan | nan | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8198 | nan | nan | 0.808 | 0.8738 | | BartForConditionalGeneration | 2 | 1.0 | 0.893 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | M2M100ForConditionalGeneration | 8 | 1.007 | 0.9507 | 0.3799 | nan | 0.7712 | 1.016 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3614 | nan | 0.7492 | 0.9186 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8743 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0246 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | nan | 0.6775 | 0.8801 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.6731 | 0.905 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8993 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | nan | 0.6375 | 0.8975 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | nan | 0.4267 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9956 | 0.8421 | 1.2485 | 1.8144 | 1.7733 | | lcnet_050 | 128 | 0.9568 | 0.9489 | 0.7675 | 1.4962 | 1.6425 | 1.6316 | | coat_lite_mini | 128 | 1.0 | 1.0 | 0.8447 | 1.0566 | 1.6056 | 1.5895 | | regnety_002 | 128 | 0.9778 | 0.9844 | 0.8615 | 1.3561 | 1.4813 | 1.3447 | | dm_nfnet_f0 | 128 | 1.0 | 1.0003 | 0.0 | 1.2124 | 1.4725 | 1.422 | | xcit_large_24_p8_224 | 5 | 1.003 | 1.0032 | 0.0 | 0.0 | 1.4529 | 1.4094 | | hrnet_w18 | 128 | 0.9999 | 0.9985 | 0.0 | 1.3201 | 1.418 | 1.3775 | | volo_d1_224 | 64 | 0.9999 | 0.9959 | 0.0 | 1.1295 | 1.3859 | 1.3634 | | dla102 | 128 | 1.0002 | 1.0008 | 0.0 | 1.2853 | 1.3821 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7891 | 0.0 | 1.0518 | 1.3733 | 1.3288 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2307 | 1.3564 | 1.3208 | | mobilenetv2_100 | 128 | 0.9662 | 0.9648 | 0.7065 | 1.0145 | 1.3373 | 1.3526 | | mobilenetv3_large_100 | 128 | 0.9664 | 0.9632 | 0.7654 | 1.1624 | 1.3356 | 1.3413 | | crossvit_9_240 | 128 | 0.9999 | 0.9988 | 0.0 | 1.0243 | 1.3305 | 1.3051 | | adv_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1253 | 1.328 | 1.3083 | | gluon_inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1224 | 1.3249 | 1.3075 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1257 | 1.3244 | 1.3076 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.166 | 1.3121 | 1.2748 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 1.1963 | 1.3115 | 1.2714 | | gmixer_24_224 | 128 | 0.9999 | 0.8348 | 0.0 | 0.98 | 1.2974 | 1.2696 | | fbnetv3_b | 128 | 0.9642 | 0.9614 | 0.7623 | 1.1326 | 1.283 | 1.2951 | | botnet26t_256 | 128 | 0.9851 | 0.9857 | 0.7892 | 1.2271 | 1.2742 | 1.2801 | | jx_nest_base | 32 | 0.9998 | 0.9926 | 0.0 | 1.217 | 1.2725 | 1.2481 | | sebotnet33ts_256 | 64 | 0.9753 | 0.8072 | 0.0 | 1.0528 | 1.2706 | 1.2762 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7721 | 0.0 | 1.0301 | 1.2706 | 1.2477 | | selecsls42b | 128 | 0.9998 | 0.9991 | 0.8157 | 1.2083 | 1.2671 | 1.2514 | | tf_efficientnet_b0 | 128 | 0.9776 | 0.7843 | 0.0 | 0.9848 | 1.2613 | 1.2686 | | mnasnet_100 | 128 | 0.9663 | 0.9639 | 0.7855 | 1.1575 | 1.2598 | 1.2787 | | eca_halonext26ts | 128 | 0.9877 | 0.7787 | 0.0 | 1.0289 | 1.2502 | 1.2494 | | fbnetc_100 | 128 | 0.967 | 0.9622 | 0.7908 | 1.1879 | 1.2497 | 1.2635 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9777 | 0.7445 | 1.1452 | 1.2404 | 1.2461 | | spnasnet_100 | 128 | 0.9605 | 0.9573 | 0.7734 | 1.1366 | 1.2375 | 1.2543 | | cspdarknet53 | 64 | 0.9581 | 0.9526 | 0.7322 | 1.1835 | 1.2287 | 1.2391 | | res2net101_26w_4s | 64 | 0.9997 | 0.9972 | 0.7705 | 1.1739 | 1.2283 | 1.1885 | | convit_base | 64 | 0.9998 | 0.9992 | 0.0 | 1.195 | 1.2216 | 1.2164 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.055 | 1.221 | 1.211 | | gmlp_s16_224 | 128 | 1.0 | 0.9994 | 0.0 | 0.9989 | 1.2164 | 1.2053 | | rexnet_100 | 128 | 0.9723 | 0.8169 | 0.0 | 0.9835 | 1.2142 | 1.2193 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.0838 | 1.2112 | 1.1932 | | tinynet_a | 128 | 0.9659 | 0.7757 | 0.6205 | 0.9713 | 1.1925 | 1.1949 | | cait_m36_384 | 4 | 0.9998 | 0.0 | 0.0 | 0.0 | 1.1826 | 1.158 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 1.0177 | 1.173 | 1.1697 | | dpn107 | 32 | 0.958 | 0.9367 | 0.7817 | 1.0288 | 1.1726 | 1.202 | | mobilevit_s | 64 | 0.9792 | 0.762 | 0.0 | 0.9468 | 1.1702 | 1.1666 | | repvgg_a2 | 128 | 0.9641 | 0.9623 | 0.8288 | 1.1224 | 1.1692 | 1.1652 | | poolformer_m36 | 64 | 0.9998 | 0.9993 | 0.0 | 0.0 | 1.1661 | 1.1475 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0185 | 1.1534 | 1.1505 | | twins_pcpvt_base | 64 | 1.0001 | 0.9974 | 0.75 | 1.0624 | 1.148 | 1.1172 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9785 | 0.0 | 0.9932 | 1.1469 | 1.1322 | | convnext_base | 64 | 0.9999 | 0.9988 | 0.0 | 1.0441 | 1.1157 | 1.1262 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9801 | 0.0 | 0.9504 | 1.1141 | 1.1053 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9988 | 0.0 | 1.1071 | 1.1068 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9995 | 0.7673 | 1.0156 | 1.0955 | 1.0834 | | gluon_xception65 | 32 | 0.9998 | 0.9975 | 0.0 | 1.0403 | 1.0871 | 1.0759 | | vit_base_patch16_224 | 64 | 1.0002 | 0.999 | 0.7662 | 0.9763 | 1.0855 | 1.0734 | | mixer_b16_224 | 128 | 1.0006 | 1.0001 | 0.0 | 0.9771 | 1.0808 | 1.0736 | | convmixer_768_32 | 32 | 0.9999 | 1.0002 | 0.0 | 1.0615 | 1.0783 | 1.0744 | | gernet_l | 128 | 0.9744 | 0.9723 | 0.8239 | 1.0992 | 1.075 | 1.0704 | | visformer_small | 128 | 1.0001 | 1.0022 | 0.797 | 1.0217 | 1.0495 | 1.0162 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6956 | 0.0 | 0.9499 | 0.9719 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 1.6263 | 0.0 | 1.5436 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.064 | 13.0072 | 21.5012 | 42.855 | 431.1592 | 426.4103 | | coat_lite_mini | 128 | 1.0194 | 5.4653 | 7.961 | 14.7686 | 362.4216 | 372.6703 | | mobilevit_s | 64 | 1.5683 | 7.1641 | nan | 42.4621 | 233.8428 | 237.9062 | | eca_halonext26ts | 128 | 1.4144 | 5.4751 | nan | 55.2357 | 204.8437 | 207.0974 | | sebotnet33ts_256 | 64 | 1.7651 | 6.6709 | nan | 51.039 | 185.8238 | 191.2608 | | eca_botnext26ts_256 | 128 | 1.3797 | 5.2911 | nan | 52.9221 | 179.8768 | 176.7545 | | swin_base_patch4_window7_224 | 64 | 2.5123 | 12.7354 | nan | 58.0591 | 177.0112 | 174.7488 | | xcit_large_24_p8_224 | 5 | 2.603 | 17.1709 | nan | nan | 172.3324 | 164.8544 | | jx_nest_base | 32 | 1.6708 | 9.2321 | nan | 57.8786 | 155.4547 | 156.5451 | | convnext_base | 64 | 1.2341 | 5.9929 | nan | 20.8438 | 133.0295 | 129.8216 | | cait_m36_384 | 4 | 2.6486 | nan | nan | nan | 132.7509 | 130.12 | | hrnet_w18 | 128 | 5.6217 | 31.9848 | nan | 251.7181 | 106.8258 | 100.7524 | | botnet26t_256 | 128 | 1.3057 | 4.4635 | 10.0598 | 40.2751 | 106.2411 | 103.5341 | | crossvit_9_240 | 128 | 1.3396 | 7.9862 | nan | 27.0701 | 97.9064 | 96.8689 | | resnest101e | 64 | 2.998 | 16.9945 | nan | 78.2291 | 93.9541 | 89.7619 | | pnasnet5large | 16 | 4.1626 | 22.9703 | nan | 123.7628 | 87.4338 | 84.1545 | | volo_d1_224 | 64 | 1.1595 | 7.6273 | nan | 28.0879 | 85.2424 | 83.6849 | | gmlp_s16_224 | 128 | 0.9511 | 6.2939 | nan | 13.365 | 71.7498 | 69.4367 | | visformer_small | 128 | 0.9009 | 4.189 | 6.2793 | 24.3038 | 71.1462 | 69.6831 | | pit_b_224 | 64 | 0.9339 | 4.8631 | nan | 12.5251 | 66.2774 | 65.1378 | | res2net101_26w_4s | 64 | 2.9852 | 17.3432 | 28.4155 | 80.897 | 55.6027 | 52.0513 | | gmixer_24_224 | 128 | 1.0133 | 7.3092 | nan | 16.5474 | 51.9895 | 50.5586 | | convit_base | 64 | 0.9843 | 5.9421 | nan | 18.0525 | 50.9922 | 49.952 | | res2net50_14w_8s | 128 | 2.5693 | 15.6494 | nan | 98.8662 | 50.8157 | 49.7271 | | gluon_xception65 | 32 | 1.6885 | 11.1965 | nan | 41.7582 | 49.2318 | 45.5937 | | poolformer_m36 | 64 | 1.8121 | 9.7062 | nan | nan | 47.0371 | 44.6651 | | resmlp_12_224 | 128 | 0.6088 | 2.794 | 5.5064 | nan | 42.3381 | 38.0426 | | swsl_resnext101_32x16d | 32 | 1.6289 | 10.0288 | nan | 39.6141 | 41.9677 | 41.3616 | | dpn107 | 32 | 3.7727 | 14.7274 | 45.6394 | 76.1359 | 40.3245 | 37.6555 | | mixer_b16_224 | 128 | 0.6548 | 3.2155 | nan | 10.7856 | 37.0102 | 35.4768 | | deit_base_distilled_patch16_224 | 64 | 0.8289 | 4.303 | 6.6094 | 10.4203 | 36.0592 | 34.6956 | | convmixer_768_32 | 32 | 1.0862 | 6.4498 | nan | 13.7196 | 35.8067 | 33.0945 | | fbnetv3_b | 128 | 3.0734 | 11.1026 | 29.9803 | 76.0043 | 35.7771 | 33.8855 | | vit_base_patch16_224 | 64 | 0.8583 | 4.1826 | 6.5315 | 9.6845 | 35.7583 | 35.0589 | | gluon_inception_v3 | 128 | 1.4815 | 8.9849 | nan | 66.9443 | 35.0345 | 32.4497 | | inception_v3 | 128 | 1.4787 | 9.0238 | nan | 67.1459 | 34.8548 | 32.5473 | | adv_inception_v3 | 128 | 1.4876 | 8.9769 | nan | 66.9311 | 34.3905 | 32.5332 | | tf_mixnet_l | 128 | 5.7484 | 13.3541 | nan | 68.7911 | 33.8729 | 32.1963 | | ghostnet_100 | 128 | 2.6432 | 9.6507 | 13.7666 | 58.927 | 32.695 | 30.8681 | | beit_base_patch16_224 | 64 | 1.0871 | 5.6134 | nan | 13.7621 | 32.6318 | 30.8008 | | mixnet_l | 128 | 5.3204 | 12.7271 | nan | 67.9763 | 32.5983 | 31.893 | | dm_nfnet_f0 | 128 | 2.0094 | 7.6042 | nan | 29.9754 | 32.3805 | 29.3454 | | dla102 | 128 | 1.6603 | 10.0975 | nan | 63.1714 | 32.1124 | 30.2312 | | res2next50 | 128 | 1.4989 | 8.7791 | nan | 66.7002 | 29.6202 | 27.9053 | | rexnet_100 | 128 | 1.8062 | 7.4568 | nan | 102.1027 | 26.5523 | 25.3591 | | tinynet_a | 128 | 1.9614 | 8.2078 | 20.2872 | 61.7507 | 25.7941 | 24.6542 | | cspdarknet53 | 64 | 2.2264 | 7.7188 | 20.8213 | 48.0307 | 23.2515 | 22.0433 | | nfnet_l0 | 128 | 1.7245 | 7.5828 | nan | 27.3095 | 23.1165 | 21.8966 | | tf_efficientnet_b0 | 128 | 1.7202 | 6.9673 | nan | 61.9316 | 22.7574 | 21.5149 | | fbnetc_100 | 128 | 1.9567 | 6.9499 | 18.078 | 45.3002 | 21.9517 | 20.7368 | | spnasnet_100 | 128 | 1.9161 | 6.665 | 17.4815 | 43.4797 | 21.4795 | 20.4556 | | mobilenetv3_large_100 | 128 | 1.5899 | 5.5688 | 13.4352 | 64.4429 | 19.9372 | 19.5642 | | mnasnet_100 | 128 | 1.6356 | 5.5127 | 14.0767 | 37.4665 | 18.8558 | 18.0133 | | mobilenetv2_100 | 128 | 1.6442 | 5.4933 | 13.7945 | 37.5793 | 18.5669 | 17.7858 | | gernet_l | 128 | 1.8816 | 6.4469 | 16.2236 | 35.9904 | 18.4345 | 17.2115 | | repvgg_a2 | 128 | 1.8567 | 6.1905 | 15.7371 | 43.751 | 17.9569 | 16.9557 | | regnety_002 | 128 | 1.4855 | 5.8417 | 13.8786 | 46.2472 | 17.8219 | 17.3541 | | selecsls42b | 128 | 0.7717 | 4.0352 | 5.8995 | 39.8612 | 16.4046 | 15.3492 | | lcnet_050 | 128 | 0.9705 | 3.4278 | 7.1291 | 31.167 | 13.6937 | 12.51 | | ese_vovnet19b_dw | 128 | 0.9768 | 3.251 | 6.9304 | 30.8107 | 12.7375 | 11.8284 | | tnt_s_patch16_224 | 128 | 1.4723 | 10.2065 | nan | 22.8828 | nan | 50.0197 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2908 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1793 | 1.2286 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1185 | 1.1745 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8403 | 0.9888 | 1.0866 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9799 | 0.9971 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9611 | 1.0549 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.9576 | 0.9855 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.9187 | 1.0509 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.9068 | 1.0518 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.8417 | 1.0633 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.8404 | 1.0528 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.6496 | 0.8704 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/torchbench_float32.png : ![](https://i.imgur.com/YAmdJK1.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/pXVOi3s.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/nN0yzdO.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 52/53 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 52/53 | 100%, 42/42 | 97%, 59/61  |
|     aot_cudagraphs     | 75%, 40/53 | 55%, 23/42  | 80%, 49/61  |
|      aot_nvfuser       | 60%, 32/53 |  0%, 0/42   | 87%, 53/61  |
|        inductor        | 87%, 46/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 89%, 47/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.19x    |    1.05x    |    1.00x    |
|      aot_nvfuser       |   1.16x    |    0.0x     |    1.18x    |
|        inductor        |   1.82x    |    1.79x    |    1.42x    |
| inductor_no_cudagraphs |   1.36x    |    1.54x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.29    |    2.65     |    2.11     |
|       aot_eager        |    8.47    |    12.63    |    11.01    |
|     aot_cudagraphs     |   10.99    |    21.63    |    20.31    |
|      aot_nvfuser       |   26.97    |     0.0     |    68.40    |
|        inductor        |   57.44    |    62.79    |    89.06    |
| inductor_no_cudagraphs |   60.44    |    57.49    |    87.16    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.42x    |    0.38x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    0.0x     |    0.84x    |
|        inductor        |   0.83x    |    0.91x    |    0.95x    |
| inductor_no_cudagraphs |   0.92x    |    1.08x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0015 | 0.903 | 2.5542 | 1.3974 | 5.8305 | 1.3362 | | functorch_dp_cifar10 | 64 | 1.0051 | 0.9122 | 2.4037 | 1.1922 | 4.9773 | 1.3812 | | timm_efficientdet | 1 | 0.9859 | 0.8024 | 0.0 | 0.0 | 4.6441 | 1.5539 | | resnext50_32x4d | 8 | 1.0013 | 0.9432 | 1.8333 | 1.3273 | 3.8723 | 1.2702 | | timm_vision_transformer | 8 | 1.008 | 0.8501 | 1.742 | 1.3624 | 3.2471 | 1.5463 | | BERT_pytorch | 16 | 1.0097 | 0.8329 | 0.0 | 0.0 | 3.1928 | 2.3713 | | mobilenet_v3_large | 32 | 1.0032 | 1.0002 | 1.4877 | 1.3516 | 3.0229 | 1.4325 | | drq | 1 | 1.0115 | 0.791 | 1.7557 | 1.0892 | 3.0111 | 1.1618 | | resnet18 | 16 | 0.9998 | 0.9836 | 1.6375 | 1.329 | 2.7965 | 1.2751 | | mnasnet1_0 | 32 | 0.9991 | 1.0096 | 1.2564 | 1.3313 | 2.5975 | 1.367 | | dcgan | 32 | 0.9794 | 0.9124 | 1.6954 | 0.7727 | 2.5528 | 1.0745 | | hf_T5_large | 2 | 1.0234 | 0.8593 | 0.0 | 0.0 | 2.4405 | 2.1154 | | squeezenet1_1 | 32 | 0.9965 | 0.9462 | 1.4589 | 1.1787 | 2.4194 | 1.3043 | | hf_Albert | 8 | 1.0007 | 0.9558 | 0.7735 | 0.0 | 2.3758 | 2.3241 | | timm_efficientnet | 32 | 0.9581 | 0.8098 | 1.1659 | 1.1758 | 2.2672 | 1.3034 | | pytorch_struct | 200 | 0.9935 | 0.7348 | 1.0231 | 0.9919 | 2.1166 | 1.2655 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9991 | 0.9109 | 1.7221 | 1.2064 | 2.1135 | 1.3964 | | lennard_jones | 1000 | 0.9772 | 0.7441 | 1.2718 | 1.0356 | 2.0608 | 1.0576 | | hf_Bart | 4 | 1.0103 | 0.8449 | 0.0 | 0.0 | 2.0391 | 1.6735 | | hf_Bert | 4 | 1.0349 | 0.8602 | 0.9399 | 0.0 | 1.9983 | 1.8402 | | resnet50 | 32 | 1.0022 | 1.0033 | 1.0301 | 1.3595 | 1.9266 | 1.3564 | | timm_resnest | 32 | 1.0048 | 1.0197 | 0.8351 | 1.312 | 1.9241 | 1.6769 | | hf_GPT2 | 4 | 1.0171 | 0.9849 | 0.0 | 0.0 | 1.8692 | 1.8085 | | LearningToPaint | 96 | 1.0017 | 0.995 | 1.1601 | 1.3515 | 1.8513 | 1.311 | | hf_T5 | 8 | 1.0011 | 0.9463 | 0.0 | 0.0 | 1.8365 | 1.8371 | | soft_actor_critic | 256 | 1.0145 | 0.732 | 1.3397 | 1.0578 | 1.7429 | 1.037 | | speech_transformer | 32 | 1.003 | 0.8409 | 0.0 | 0.0 | 1.7106 | 1.6704 | | shufflenet_v2_x1_0 | 128 | 1.003 | 1.0117 | 0.9602 | 1.3374 | 1.7071 | 1.4237 | | mobilenet_v2 | 96 | 0.9998 | 1.013 | 0.7636 | 0.9257 | 1.5601 | 1.4988 | | attention_is_all_you_need_pytorch | 256 | 1.0093 | 0.9266 | 0.0 | 0.0 | 1.5239 | 1.4695 | | timm_nfnet | 128 | 0.9988 | 0.9993 | 0.0 | 1.1742 | 1.4979 | 1.4301 | | fastNLP_Bert | 6 | 0.9993 | 0.8745 | 0.7662 | 0.0 | 1.4721 | 1.4183 | | hf_DistilBert | 8 | 1.0013 | 0.9725 | 0.7339 | 0.0 | 1.4625 | 1.4374 | | pytorch_unet | 1 | 0.9996 | 0.9928 | 0.8627 | 1.1557 | 1.3444 | 1.3151 | | pytorch_stargan | 16 | 0.9962 | 1.042 | 0.9686 | 1.0909 | 1.3115 | 1.2631 | | timm_vovnet | 32 | 0.9181 | 0.8811 | 0.8605 | 1.1327 | 1.2991 | 1.1483 | | timm_regnet | 32 | 0.9787 | 0.9334 | 0.8852 | 1.1754 | 1.2976 | 1.2383 | | Super_SloMo | 6 | 0.9996 | 0.9958 | 0.8864 | 0.0 | 1.2911 | 1.257 | | vgg16 | 64 | 0.9996 | 0.9976 | 0.8575 | 0.9945 | 1.2719 | 1.2625 | | Background_Matting | 4 | 0.9999 | 1.0184 | 0.8934 | 1.1155 | 1.2258 | 1.2096 | | alexnet | 128 | 0.9995 | 0.9973 | 0.8153 | 1.003 | 1.2121 | 1.2079 | | timm_vision_transformer_large | 8 | 1.0 | 0.9904 | 0.0 | 0.9936 | 1.1621 | 1.1383 | | hf_Reformer | 4 | 0.9959 | 1.0001 | 0.9451 | 0.0 | 1.158 | 1.153 | | hf_BigBird | 2 | 0.9955 | 0.9189 | 1.0476 | 0.0 | 1.154 | 1.0283 | | yolov3 | 16 | 0.9998 | 0.991 | 0.8032 | 0.8698 | 1.1035 | 1.0801 | | tts_angular | 64 | 0.9869 | 0.9355 | 0.9834 | 0.9909 | 1.0043 | 1.0149 | | demucs | 4 | 1.0004 | 1.0003 | 1.0009 | 0.9991 | 1.0019 | 0.9989 | | nvidia_deeprecommender | 256 | 0.9993 | 0.9964 | 0.6973 | 0.9788 | 0.9901 | 1.0298 | | dlrm | 2048 | 1.1324 | 1.1034 | 0.0 | 0.0 | 0.9351 | 0.0 | | hf_GPT2_large | 4 | 1.0004 | 0.991 | 0.0 | 0.0 | 0.0 | 1.7437 | | tacotron2 | 64 | 0.9831 | 0.7609 | 1.0061 | 0.0 | 0.0 | 0.883 | | hf_Longformer | 2 | 0.9641 | 0.8775 | 0.8954 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | pass | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.9114 | 44.0837 | nan | nan | 492.8052 | 511.3603 | | yolov3 | 16 | 2.9509 | 10.458 | 14.4773 | 41.3766 | 439.0747 | 436.3863 | | hf_T5_large | 2 | 13.8119 | 47.1709 | nan | nan | 230.7633 | 222.5011 | | speech_transformer | 32 | 1.9497 | 11.2497 | nan | nan | 162.3094 | 159.0526 | | timm_vision_transformer | 8 | 0.9749 | 5.8412 | 7.9448 | 14.0461 | 150.3482 | 144.8052 | | attention_is_all_you_need_pytorch | 256 | 1.2653 | 9.3012 | nan | nan | 146.818 | 150.856 | | timm_vision_transformer_large | 8 | 2.757 | 19.5575 | nan | 37.911 | 143.2689 | 143.5467 | | timm_resnest | 32 | 0.6011 | 3.4208 | 4.6698 | 42.0817 | 137.1813 | 129.8015 | | pytorch_stargan | 16 | 0.411 | 2.7576 | 3.6579 | 6.8457 | 105.3448 | 111.7027 | | BERT_pytorch | 16 | 1.6793 | 9.8107 | nan | nan | 104.3494 | 102.8372 | | pytorch_struct | 200 | 0.2721 | 1.097 | 1.745 | 5.276 | 80.2653 | 81.8498 | | fastNLP_Bert | 6 | 1.7408 | 9.3776 | 13.5213 | nan | 72.3646 | 69.3339 | | hf_GPT2 | 4 | 1.4611 | 7.9839 | nan | nan | 67.4414 | 66.1509 | | hf_Bart | 4 | 1.7362 | 11.152 | nan | nan | 55.024 | 55.477 | | densenet121 | 4 | 2.2111 | 16.3732 | 24.8739 | 127.2016 | 53.6837 | 50.6568 | | hf_T5 | 8 | 2.3589 | 11.0701 | nan | nan | 53.2822 | 50.6323 | | hf_BigBird | 2 | 8.2266 | 16.8322 | 37.1242 | nan | 49.2829 | 31.4395 | | mobilenet_v3_large | 32 | 0.9736 | 6.349 | 8.9568 | 72.4257 | 47.9591 | 47.5266 | | hf_Albert | 8 | 1.3677 | 8.3808 | 12.4608 | nan | 47.7013 | 47.7293 | | hf_Bert | 4 | 1.5692 | 8.6915 | 12.1088 | nan | 45.7389 | 45.1049 | | timm_regnet | 32 | 2.3016 | 10.6663 | 24.2721 | 59.3949 | 39.5138 | 39.0703 | | timm_efficientnet | 32 | 1.8065 | 8.466 | 18.8325 | 69.1885 | 36.8646 | 36.2609 | | hf_Reformer | 4 | 2.4895 | 5.097 | 9.8331 | nan | 36.4816 | 30.8703 | | hf_DistilBert | 8 | 0.6209 | 4.1757 | 8.2817 | nan | 34.7598 | 34.8714 | | timm_nfnet | 128 | 2.0752 | 8.883 | nan | 37.9942 | 33.2343 | 31.3883 | | mnasnet1_0 | 32 | 0.8436 | 5.8292 | 8.0116 | 43.3539 | 31.7482 | 31.3722 | | resnext50_32x4d | 8 | 0.9736 | 6.1999 | 8.4015 | 36.2031 | 31.1282 | 32.0186 | | timm_vovnet | 32 | 1.4893 | 5.5062 | 11.824 | 30.9869 | 31.0935 | 30.1586 | | resnet50 | 32 | 0.8799 | 6.2202 | 8.514 | 40.7493 | 29.964 | 30.6567 | | functorch_dp_cifar10 | 64 | 0.3826 | 2.4582 | 3.3723 | 6.311 | 27.1014 | 26.7867 | | resnet18 | 16 | 0.4581 | 2.3811 | 3.4255 | 23.0229 | 21.9997 | 21.4065 | | shufflenet_v2_x1_0 | 128 | 0.9845 | 6.6871 | 9.2419 | 37.2394 | 21.5079 | 20.6157 | | Background_Matting | 4 | 0.9358 | 6.1012 | 8.6269 | 41.4341 | 20.2369 | 19.4603 | | Super_SloMo | 6 | 1.0597 | 6.1892 | 8.1656 | nan | 20.0702 | 19.6506 | | mobilenet_v2 | 96 | 0.8432 | 5.7818 | 8.4424 | 40.7764 | 19.8779 | 19.7928 | | pytorch_unet | 1 | 0.4658 | 2.7325 | 3.7737 | 26.2469 | 9.6139 | 9.1328 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4277 | 2.8222 | 3.7265 | 4.774 | 9.5293 | 9.2305 | | LearningToPaint | 96 | 0.4592 | 2.4914 | 3.5797 | 29.9899 | 8.3718 | 8.0385 | | squeezenet1_1 | 32 | 0.2661 | 1.3923 | 1.9142 | 6.4786 | 5.1781 | 4.8207 | | nvidia_deeprecommender | 256 | 0.1953 | 0.6331 | 0.9235 | 2.9178 | 4.8321 | 4.3927 | | vgg16 | 64 | 0.1828 | 0.9539 | 1.4132 | 3.6292 | 4.363 | 4.0061 | | drq | 1 | 0.1592 | 0.6488 | 0.9825 | 4.3666 | 4.1608 | 3.6575 | | dlrm | 2048 | 0.4597 | 1.0084 | nan | nan | 3.847 | nan | | soft_actor_critic | 256 | 0.2128 | 0.4323 | 0.6095 | 2.0178 | 3.6294 | 2.9491 | | alexnet | 128 | 0.164 | 0.5874 | 0.8993 | 3.1328 | 3.4081 | 3.2135 | | dcgan | 32 | 0.174 | 0.5412 | 0.8098 | 4.2474 | 2.8988 | 2.6858 | | lennard_jones | 1000 | 0.1533 | 0.4318 | 0.6178 | 1.4628 | 2.2326 | 1.9707 | | tts_angular | 64 | 0.2265 | 0.2958 | 0.4262 | 1.0385 | 1.8991 | 1.6791 | | demucs | 4 | 0.3415 | 0.3463 | 0.3455 | 0.3616 | 0.2601 | 0.2504 | | hf_GPT2_large | 4 | 5.4141 | 25.1179 | nan | nan | nan | 153.4359 | | tacotron2 | 64 | 17.0671 | 32.0887 | 54.1322 | nan | nan | 108.7741 | | hf_Longformer | 2 | 6.1566 | 16.1726 | 84.5126 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2719 | 0.7887 | 1.2042 | 1.2318 | | hf_Albert | 8 | 0.9814 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3842 | nan | 1.0536 | 1.1475 | | timm_nfnet | 128 | 0.9693 | 0.8982 | nan | 0.9445 | 1.0337 | 1.1245 | | timm_efficientdet | 1 | 1.028 | 0.8404 | nan | nan | 1.0226 | 1.0403 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9117 | 1.0074 | 1.0232 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0002 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 0.9829 | 1.1269 | | BERT_pytorch | 16 | 1.0 | 0.8825 | nan | nan | 0.9728 | 1.1006 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | nan | 0.9648 | 1.1252 | | Background_Matting | 4 | 1.0138 | 0.9624 | 0.3723 | 0.9813 | 0.9316 | 0.9364 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | nan | 0.9309 | 1.2521 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3494 | 0.85 | 0.9249 | 0.9292 | | speech_transformer | 32 | 1.0017 | 0.9174 | nan | nan | 0.9066 | 0.9109 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | 0.8244 | 0.8991 | 0.9038 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8609 | 0.4238 | 0.8441 | 0.8861 | 0.982 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | 0.8494 | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3481 | 0.8623 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3667 | 0.8376 | 0.8753 | 0.9535 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3903 | nan | 0.8736 | 0.9414 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3571 | 0.8496 | 0.8678 | 0.8715 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8661 | 1.0348 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3562 | 0.7995 | 0.8659 | 0.885 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3458 | 0.7589 | 0.8611 | 0.8951 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3573 | 0.8503 | 0.856 | 0.8927 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3414 | nan | 0.8387 | 0.9058 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7073 | 0.8283 | 0.8738 | | hf_Bart | 4 | 0.9102 | 0.8125 | nan | nan | 0.8137 | 0.9762 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4544 | nan | 0.8098 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.7753 | 0.7974 | 0.9099 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3446 | 0.866 | 0.7918 | 0.8145 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4253 | 0.8882 | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3882 | 0.8176 | 0.7644 | 0.7753 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8207 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8308 | 0.752 | 0.9256 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3405 | 0.7742 | 0.7513 | 0.761 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7172 | 0.7491 | 0.7534 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9149 | 0.7295 | 1.0367 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6722 | 0.7295 | 0.8017 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3916 | 0.8871 | 0.7151 | 0.7249 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | nan | 0.704 | nan | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3947 | 0.7276 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 1.0967 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | 0.8452 | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5082 | 0.4235 | 0.4307 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1351 | | tacotron2 | 64 | 0.9866 | 0.4047 | 0.3143 | nan | nan | 0.4114 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3492 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0235 | 0.8558 | 0.0 | 0.0 | 5.483 | 1.6487 | | MobileBertForMaskedLM | 32 | 1.0252 | 0.8276 | 0.0 | 0.0 | 5.2736 | 1.7228 | | CamemBert | 1 | 1.0387 | 0.8591 | 1.7215 | 0.0 | 3.653 | 1.8276 | | MT5ForConditionalGeneration | 8 | 1.0188 | 0.8671 | 0.0 | 0.0 | 3.6158 | 2.5334 | | MobileBertForQuestionAnswering | 64 | 1.0186 | 0.8299 | 0.0 | 0.0 | 3.6056 | 1.7828 | | DistillGPT2 | 1 | 1.0273 | 0.8953 | 1.2248 | 0.0 | 3.1234 | 2.0305 | | M2M100ForConditionalGeneration | 8 | 1.0422 | 0.866 | 1.2047 | 0.0 | 2.6847 | 1.8301 | | PLBartForConditionalGeneration | 16 | 1.0137 | 0.8405 | 0.0 | 0.0 | 2.3305 | 1.7503 | | MegatronBertForQuestionAnswering | 16 | 1.0292 | 0.8507 | 1.0548 | 0.0 | 2.2397 | 1.804 | | GPT2ForSequenceClassification | 4 | 1.0015 | 0.9769 | 0.0 | 0.0 | 2.1507 | 2.1149 | | XGLMForCausalLM | 8 | 1.0138 | 0.8239 | 0.0 | 0.0 | 2.0552 | 1.7342 | | ElectraForQuestionAnswering | 64 | 1.0003 | 0.9802 | 0.7604 | 0.0 | 1.9541 | 1.9093 | | MBartForConditionalGeneration | 16 | 1.0141 | 0.8508 | 0.0 | 0.0 | 1.8315 | 1.6062 | | ElectraForCausalLM | 32 | 1.0002 | 0.9429 | 0.7162 | 0.0 | 1.8026 | 1.8009 | | MegatronBertForCausalLM | 16 | 1.0335 | 0.8507 | 0.9584 | 0.0 | 1.7953 | 1.7105 | | LayoutLMForSequenceClassification | 16 | 1.0007 | 0.9808 | 0.7769 | 0.0 | 1.7445 | 1.7013 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.8859 | 0.0 | 0.0 | 1.6775 | 1.666 | | T5Small | 1 | 1.0245 | 0.8955 | 0.0 | 0.0 | 1.6644 | 1.47 | | AlbertForMaskedLM | 4 | 1.0001 | 0.8853 | 0.0 | 0.0 | 1.6623 | 1.6537 | | PegasusForConditionalGeneration | 16 | 1.0134 | 0.8306 | 0.8996 | 0.0 | 1.6588 | 1.6502 | | Speech2Text2ForCausalLM | 128 | 1.0043 | 0.9374 | 0.7216 | 0.0 | 1.6199 | 1.5933 | | T5ForConditionalGeneration | 4 | 1.0036 | 0.9327 | 0.0 | 0.0 | 1.6029 | 1.598 | | LayoutLMForMaskedLM | 16 | 1.0005 | 0.9722 | 0.757 | 0.0 | 1.5899 | 1.5607 | | OPTForCausalLM | 32 | 1.0119 | 0.9294 | 0.0 | 0.0 | 1.5854 | 1.5609 | | DistilBertForQuestionAnswering | 64 | 1.0011 | 0.9688 | 0.7412 | 0.0 | 1.4539 | 1.4069 | | BartForConditionalGeneration | 2 | 1.0055 | 0.9714 | 0.0 | 0.0 | 1.4529 | 1.4264 | | BartForCausalLM | 4 | 1.0009 | 0.9695 | 0.0 | 0.0 | 1.4515 | 1.451 | | BertForQuestionAnswering | 128 | 1.0001 | 0.9849 | 0.7783 | 0.0 | 1.4282 | 1.4067 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0059 | 0.9211 | 0.0 | 0.0 | 1.4264 | 1.4402 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.9838 | 0.7757 | 0.0 | 1.4241 | 1.4018 | | RobertaForCausalLM | 64 | 1.0005 | 0.9599 | 0.7532 | 0.0 | 1.4198 | 1.391 | | BertForMaskedLM | 64 | 0.9996 | 0.9589 | 0.7407 | 0.0 | 1.3266 | 1.3117 | | PLBartForCausalLM | 32 | 1.0059 | 0.9421 | 0.7969 | 0.0 | 1.3165 | 1.3156 | | BlenderbotSmallForCausalLM | 64 | 1.0006 | 0.9275 | 0.0 | 0.0 | 1.2917 | 1.3053 | | DistilBertForMaskedLM | 64 | 1.0005 | 0.952 | 0.7092 | 0.0 | 1.2725 | 1.271 | | DebertaForMaskedLM | 4 | 0.9355 | 0.736 | 0.8188 | 0.0 | 1.2636 | 1.1694 | | PegasusForCausalLM | 32 | 1.0016 | 0.9503 | 0.7561 | 0.0 | 1.2438 | 1.1987 | | MBartForCausalLM | 32 | 0.9998 | 0.9506 | 0.0 | 0.0 | 1.2097 | 1.2113 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9411 | 0.0 | 0.0 | 1.2065 | 1.2055 | | BigBird | 1 | 0.9782 | 0.9113 | 1.0508 | 0.0 | 1.1571 | 1.034 | | DebertaForQuestionAnswering | 8 | 0.9954 | 0.9043 | 0.7223 | 0.0 | 1.1551 | 1.2349 | | AllenaiLongformerBase | 1 | 0.9551 | 0.7384 | 0.8532 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | DebertaForMaskedLM | 4 | 5.3017 | 12.4916 | 47.5691 | nan | 211.3213 | 131.3429 | | DebertaForQuestionAnswering | 8 | 5.0697 | 12.4939 | 46.8481 | nan | 204.9258 | 135.5152 | | XGLMForCausalLM | 8 | 2.7407 | 17.1885 | nan | nan | 193.4528 | 191.6258 | | M2M100ForConditionalGeneration | 8 | 3.4286 | 20.9146 | 34.7878 | nan | 146.7786 | 133.1317 | | YituTechConvBert | 1 | 2.3925 | 13.6824 | nan | nan | 136.8308 | 137.1263 | | MobileBertForMaskedLM | 32 | 9.0942 | 39.9107 | nan | nan | 109.5767 | 108.128 | | MobileBertForQuestionAnswering | 64 | 9.2437 | 40.3418 | nan | nan | 99.942 | 98.5995 | | MT5ForConditionalGeneration | 8 | 3.5479 | 16.2442 | nan | nan | 99.4221 | 95.6132 | | MegatronBertForCausalLM | 16 | 3.5299 | 18.0083 | 26.2207 | nan | 72.8288 | 72.383 | | PegasusForConditionalGeneration | 16 | 3.3212 | 21.1437 | 33.1442 | nan | 70.9152 | 66.9883 | | MegatronBertForQuestionAnswering | 16 | 3.7279 | 18.494 | 26.5553 | nan | 70.3712 | 70.4324 | | MBartForConditionalGeneration | 16 | 3.5303 | 21.7799 | nan | nan | 68.0877 | 66.4723 | | BartForConditionalGeneration | 2 | 3.3485 | 21.9928 | nan | nan | 66.7221 | 66.7763 | | T5ForConditionalGeneration | 4 | 2.3184 | 10.9554 | nan | nan | 63.1977 | 62.3629 | | T5Small | 1 | 2.2923 | 10.8059 | nan | nan | 60.7041 | 58.9361 | | LayoutLMForSequenceClassification | 16 | 1.9684 | 9.4242 | 13.1009 | nan | 59.8057 | 62.4836 | | PLBartForConditionalGeneration | 16 | 1.7535 | 11.1872 | nan | nan | 53.0942 | 50.2183 | | BlenderbotSmallForConditionalGeneration | 64 | 2.1494 | 15.3537 | nan | nan | 50.8133 | 49.8389 | | BigBird | 1 | 8.0811 | 16.8812 | 36.2673 | nan | 48.8828 | 32.245 | | ElectraForCausalLM | 32 | 1.6602 | 8.9457 | 12.6281 | nan | 47.458 | 46.0467 | | BertForMaskedLM | 64 | 1.5157 | 8.8146 | 12.2412 | nan | 40.1186 | 39.4486 | | LayoutLMForMaskedLM | 16 | 2.0147 | 9.3845 | 13.6177 | nan | 39.7156 | 38.7913 | | ElectraForQuestionAnswering | 64 | 1.6444 | 9.0466 | 12.8008 | nan | 37.2773 | 36.2002 | | RobertaForCausalLM | 64 | 1.5418 | 8.898 | 12.8775 | nan | 35.6049 | 34.4923 | | GPT2ForSequenceClassification | 4 | 1.5018 | 7.9184 | nan | nan | 34.5853 | 32.6152 | | PegasusForCausalLM | 32 | 1.277 | 8.0313 | 12.1633 | nan | 32.796 | 31.1784 | | BertForQuestionAnswering | 128 | 1.5691 | 8.8396 | 12.3099 | nan | 31.9624 | 31.5318 | | MBartForCausalLM | 32 | 1.2391 | 8.2621 | nan | nan | 29.9812 | 30.1429 | | BartForCausalLM | 4 | 1.2927 | 8.1178 | nan | nan | 29.6104 | 28.427 | | AlbertForMaskedLM | 4 | 1.4322 | 8.5712 | nan | nan | 29.5818 | 28.7253 | | TrOCRForCausalLM | 32 | 1.2153 | 8.1026 | nan | nan | 29.3851 | 29.4579 | | DistilBertForMaskedLM | 64 | 0.6051 | 4.2714 | 8.2129 | nan | 29.0096 | 27.6761 | | RobertaForQuestionAnswering | 128 | 1.5283 | 8.84 | 13.0443 | nan | 28.486 | 28.3453 | | AlbertForQuestionAnswering | 4 | 1.4291 | 8.5193 | nan | nan | 28.3625 | 27.3942 | | BlenderbotSmallForCausalLM | 64 | 0.8239 | 5.5965 | nan | nan | 28.1519 | 27.1049 | | DistilBertForQuestionAnswering | 64 | 0.5867 | 4.3509 | 8.2981 | nan | 27.468 | 26.9617 | | OPTForCausalLM | 32 | 1.3 | 8.1977 | nan | nan | 26.6569 | 26.3537 | | DistillGPT2 | 1 | 0.7203 | 3.9956 | 5.4409 | nan | 25.7546 | 29.5859 | | CamemBert | 1 | 1.6303 | 8.926 | 12.0923 | nan | 25.5323 | 25.1641 | | Speech2Text2ForCausalLM | 128 | 0.69 | 4.1683 | 6.7967 | nan | 23.0932 | 21.8021 | | PLBartForCausalLM | 32 | 0.6399 | 4.2166 | 5.8815 | nan | 21.7613 | 21.2058 | | AllenaiLongformerBase | 1 | 6.5978 | 17.1453 | 84.5909 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.5536 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.1078 | 1.5319 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0943 | 1.1562 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | nan | 1.0779 | 1.1637 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | nan | 1.0189 | 1.089 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0005 | 1.0676 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0005 | 1.0676 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | nan | 0.995 | 1.2292 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9943 | 1.0278 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9938 | 1.0704 | | BartForConditionalGeneration | 2 | 1.0 | 0.9035 | nan | nan | 0.9913 | 1.1976 | | T5Small | 1 | 1.0 | 0.8935 | nan | nan | 0.9874 | 1.15 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9871 | 1.0263 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9868 | 1.0636 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | nan | 0.9838 | 1.0755 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.9811 | 1.0366 | | RobertaForCausalLM | 64 | 0.9991 | 0.8994 | 0.3788 | nan | 0.9801 | 1.0358 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.9642 | 1.0376 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | nan | 0.948 | 1.0272 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | nan | 0.946 | 1.0791 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9335 | 1.0986 | | ElectraForCausalLM | 32 | 0.9996 | 0.848 | 0.357 | nan | 0.9319 | 1.0177 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | 1.0441 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3979 | nan | 0.9214 | 1.0168 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.919 | 0.919 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.962 | 0.4377 | nan | 0.9159 | 1.0993 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3466 | nan | 0.9129 | 1.0128 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.9036 | 1.0277 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0093 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.8769 | nan | nan | 0.8775 | 1.0294 | | BigBird | 1 | 1.0008 | 0.9547 | 0.448 | nan | 0.8348 | 1.1049 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | nan | nan | 0.8333 | 1.0324 | | DistillGPT2 | 1 | 0.9963 | 0.8033 | 0.4019 | nan | 0.8228 | 1.0239 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.4161 | nan | 0.8157 | 0.9312 | | YituTechConvBert | 1 | 0.9718 | 0.8091 | nan | nan | 0.8103 | 0.9318 | | M2M100ForConditionalGeneration | 8 | 0.9967 | 0.9558 | 0.4308 | nan | 0.7739 | 1.0609 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6997 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3623 | nan | 0.4498 | 1.1123 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3361 | 1.1932 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9473 | 0.3844 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0018 | 0.0 | 0.0 | 0.0 | 2.5138 | 1.8289 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9981 | 0.0 | 1.9506 | 2.1263 | 2.0891 | | regnety_002 | 128 | 0.9744 | 0.9297 | 1.1188 | 1.379 | 2.0693 | 1.443 | | ghostnet_100 | 128 | 1.0031 | 0.9955 | 0.9047 | 1.5366 | 2.0586 | 1.7338 | | lcnet_050 | 128 | 0.9657 | 0.9508 | 0.8505 | 1.6006 | 1.9979 | 1.6225 | | twins_pcpvt_base | 64 | 1.0043 | 0.929 | 0.9247 | 1.3578 | 1.7285 | 1.6727 | | res2net101_26w_4s | 64 | 1.0026 | 0.9912 | 0.9458 | 1.4055 | 1.5971 | 1.3453 | | volo_d1_224 | 64 | 0.9997 | 0.9941 | 0.0 | 1.1424 | 1.5964 | 1.5611 | | hrnet_w18 | 128 | 1.0027 | 1.0165 | 0.8612 | 1.462 | 1.5862 | 1.4754 | | dla102 | 128 | 1.0 | 0.9959 | 0.8367 | 1.4156 | 1.5807 | 1.5512 | | gmlp_s16_224 | 128 | 0.9997 | 0.9957 | 0.0 | 1.0462 | 1.5572 | 1.483 | | gmixer_24_224 | 128 | 0.9999 | 0.8806 | 0.0 | 0.9759 | 1.552 | 1.5077 | | nfnet_l0 | 128 | 0.9995 | 0.8102 | 0.7106 | 1.0396 | 1.5407 | 1.4638 | | resnest101e | 64 | 0.9997 | 0.9913 | 0.8118 | 1.2515 | 1.5353 | 1.4285 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9614 | 0.0 | 1.0473 | 1.5177 | 1.5196 | | gluon_inception_v3 | 128 | 0.9999 | 0.9966 | 0.8529 | 1.1961 | 1.5086 | 1.4737 | | adv_inception_v3 | 128 | 1.0 | 0.9962 | 0.8534 | 1.1958 | 1.5061 | 1.4705 | | inception_v3 | 128 | 0.9999 | 0.9966 | 0.8525 | 1.1959 | 1.5011 | 1.4684 | | dm_nfnet_f0 | 128 | 0.9984 | 1.0002 | 0.0 | 1.1784 | 1.5 | 1.4274 | | cait_m36_384 | 4 | 1.0005 | 0.0 | 0.0 | 0.0 | 1.4668 | 1.4168 | | res2net50_14w_8s | 128 | 1.0 | 0.9943 | 0.809 | 1.2818 | 1.4665 | 1.4075 | | mobilenetv3_large_100 | 128 | 0.956 | 0.9445 | 0.7814 | 1.3446 | 1.4602 | 1.4405 | | fbnetv3_b | 128 | 0.9519 | 0.9425 | 0.7754 | 1.2585 | 1.4533 | 1.4099 | | crossvit_9_240 | 128 | 1.0001 | 0.9955 | 0.8377 | 1.0601 | 1.4527 | 1.4207 | | selecsls42b | 128 | 0.9999 | 0.9953 | 0.8416 | 1.3602 | 1.4427 | 1.4143 | | coat_lite_mini | 128 | 1.0002 | 0.9961 | 0.8471 | 1.2053 | 1.442 | 1.4073 | | resmlp_12_224 | 128 | 1.0002 | 0.9988 | 0.7822 | 0.0 | 1.4328 | 1.3808 | | mnasnet_100 | 128 | 0.9529 | 0.9444 | 0.7826 | 1.3715 | 1.4324 | 1.4573 | | res2next50 | 128 | 0.9994 | 0.9963 | 0.8299 | 1.2133 | 1.423 | 1.3521 | | mobilenetv2_100 | 128 | 0.9521 | 0.9416 | 0.7211 | 0.8669 | 1.4033 | 1.4338 | | jx_nest_base | 32 | 0.9995 | 0.9918 | 0.0 | 1.2269 | 1.399 | 1.3677 | | mobilevit_s | 64 | 0.9735 | 0.8144 | 0.6564 | 1.1126 | 1.3811 | 1.3686 | | ese_vovnet19b_dw | 128 | 0.9705 | 0.9648 | 0.7668 | 1.2432 | 1.3783 | 1.3797 | | spnasnet_100 | 128 | 0.9451 | 0.9378 | 0.7758 | 1.3175 | 1.3679 | 1.3958 | | pit_b_224 | 64 | 0.9999 | 0.9959 | 0.8217 | 1.0632 | 1.3611 | 1.356 | | fbnetc_100 | 128 | 0.9524 | 0.9434 | 0.7898 | 1.3769 | 1.3555 | 1.3749 | | convit_base | 64 | 1.0001 | 0.9963 | 0.0 | 0.0 | 1.3487 | 1.3738 | | tf_efficientnet_b0 | 128 | 0.9641 | 0.8075 | 0.6671 | 1.097 | 1.3481 | 1.3537 | | poolformer_m36 | 64 | 0.9998 | 0.9984 | 0.8061 | 0.0 | 1.33 | 1.2968 | | botnet26t_256 | 128 | 0.9801 | 0.9735 | 0.813 | 1.3482 | 1.3266 | 1.3346 | | cspdarknet53 | 64 | 0.9422 | 0.9335 | 0.7555 | 0.8991 | 1.3197 | 1.3472 | | pnasnet5large | 16 | 1.0064 | 1.029 | 0.8464 | 1.141 | 1.2979 | 1.2717 | | mixer_b16_224 | 128 | 1.0002 | 0.9981 | 0.803 | 0.9424 | 1.2925 | 1.273 | | eca_botnext26ts_256 | 128 | 0.9805 | 0.8116 | 0.6713 | 1.1588 | 1.2898 | 1.2888 | | beit_base_patch16_224 | 64 | 0.9999 | 0.979 | 0.0 | 1.0454 | 1.2856 | 1.2679 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9915 | 0.797 | 1.0625 | 1.2816 | 1.259 | | rexnet_100 | 128 | 0.9646 | 0.8617 | 0.6894 | 1.0374 | 1.2793 | 1.2779 | | tinynet_a | 128 | 0.9573 | 0.803 | 0.6597 | 1.0815 | 1.2604 | 1.2685 | | visformer_small | 128 | 0.9998 | 1.0019 | 0.8358 | 1.0878 | 1.237 | 1.1852 | | sebotnet33ts_256 | 64 | 0.9665 | 0.8377 | 0.68 | 1.117 | 1.2153 | 1.2087 | | tf_mixnet_l | 128 | 0.9806 | 0.9094 | 0.7951 | 1.0607 | 1.1977 | 1.1912 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9942 | 0.8346 | 0.994 | 1.1906 | 1.1831 | | mixnet_l | 128 | 0.9799 | 0.9055 | 0.7956 | 1.0635 | 1.1813 | 1.1788 | | gluon_xception65 | 32 | 0.9995 | 0.9891 | 0.7528 | 1.0657 | 1.161 | 1.1259 | | dpn107 | 32 | 0.9406 | 0.9253 | 0.7475 | 0.9913 | 1.1584 | 1.1784 | | swsl_resnext101_32x16d | 32 | 0.9997 | 0.9823 | 0.8058 | 1.0769 | 1.14 | 1.0574 | | repvgg_a2 | 128 | 0.944 | 0.9335 | 0.7984 | 1.1271 | 1.1395 | 1.1567 | | gernet_l | 128 | 0.9466 | 0.9376 | 0.767 | 1.1446 | 1.0663 | 1.0791 | | convmixer_768_32 | 32 | 0.9999 | 0.9974 | 0.9228 | 1.0532 | 1.0563 | 1.0508 | | convnext_base | 64 | 0.9994 | 0.9938 | 0.0 | 1.2022 | 0.6576 | 0.6429 | | eca_halonext26ts | 128 | 0.9816 | 0.8173 | 0.678 | 1.1505 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.7886 | 19.2789 | 30.8811 | 71.6398 | 450.4032 | 449.0663 | | coat_lite_mini | 128 | 1.1898 | 6.925 | 10.2699 | 32.4009 | 415.0195 | 407.8104 | | mobilevit_s | 64 | 1.8334 | 9.5546 | 18.1252 | 63.5064 | 335.9776 | 326.7777 | | sebotnet33ts_256 | 64 | 1.7766 | 7.7854 | 16.9464 | 67.7216 | 320.9297 | 322.9218 | | eca_botnext26ts_256 | 128 | 1.4388 | 6.4959 | 12.5318 | 61.6816 | 263.1896 | 256.6624 | | xcit_large_24_p8_224 | 5 | 3.2266 | nan | nan | nan | 190.9474 | 191.3842 | | botnet26t_256 | 128 | 1.3802 | 5.9108 | 11.4357 | 49.5446 | 190.114 | 192.5791 | | swin_base_patch4_window7_224 | 64 | 2.9328 | 16.2208 | nan | 70.4276 | 184.8036 | 181.5733 | | jx_nest_base | 32 | 1.7719 | 12.5102 | nan | 50.3768 | 174.1079 | 171.8631 | | convnext_base | 64 | 1.4105 | 8.8881 | nan | 35.5731 | 156.2029 | 155.9408 | | cait_m36_384 | 4 | 3.3274 | nan | nan | nan | 149.7085 | 146.6911 | | hrnet_w18 | 128 | 6.1479 | 40.5858 | 71.0524 | 452.9955 | 128.3186 | 122.6451 | | crossvit_9_240 | 128 | 1.6769 | 11.0596 | 16.4989 | 36.3303 | 122.6924 | 123.3807 | | resnest101e | 64 | 3.4105 | 21.9181 | 33.9868 | 106.6682 | 122.1278 | 123.3805 | | volo_d1_224 | 64 | 1.3075 | 9.8414 | nan | 38.6123 | 104.5584 | 101.7564 | | pnasnet5large | 16 | 4.8228 | 29.9114 | 50.9646 | 188.9162 | 101.2345 | 98.7592 | | visformer_small | 128 | 0.9721 | 5.4105 | 8.2006 | 30.7881 | 93.9667 | 93.7884 | | pit_b_224 | 64 | 1.1349 | 6.8399 | 10.4533 | 25.4554 | 88.4051 | 87.0922 | | gmlp_s16_224 | 128 | 1.2469 | 9.6414 | nan | 22.149 | 77.7271 | 75.0659 | | res2net101_26w_4s | 64 | 3.2425 | 22.1744 | 35.0968 | 118.4702 | 67.1696 | 62.6511 | | tnt_s_patch16_224 | 128 | 2.0086 | 14.8546 | nan | 37.5385 | 62.7691 | 59.8617 | | res2net50_14w_8s | 128 | 2.9692 | 19.6685 | 30.6779 | 136.2903 | 60.9233 | 58.4898 | | gmixer_24_224 | 128 | 1.4329 | 10.6824 | nan | 28.1379 | 60.2775 | 58.6871 | | convit_base | 64 | 1.2215 | 7.9288 | nan | nan | 57.9251 | 56.8513 | | gluon_xception65 | 32 | 2.1486 | 14.5975 | 21.7491 | 64.2962 | 54.6983 | 51.7244 | | poolformer_m36 | 64 | 1.9552 | 11.256 | 17.2506 | nan | 51.0105 | 48.5998 | | swsl_resnext101_32x16d | 32 | 1.8227 | 12.6355 | 18.686 | 52.7151 | 47.8203 | 45.3611 | | dpn107 | 32 | 4.0616 | 17.56 | 50.857 | 101.5033 | 47.3077 | 45.4352 | | resmlp_12_224 | 128 | 0.7259 | 3.948 | 7.9433 | nan | 45.4439 | 40.973 | | fbnetv3_b | 128 | 3.3446 | 14.3348 | 36.4534 | 99.9885 | 44.8979 | 41.3769 | | deit_base_distilled_patch16_224 | 64 | 0.9803 | 6.0124 | 8.8429 | 14.5051 | 43.952 | 43.516 | | vit_base_patch16_224 | 64 | 0.9723 | 6.2255 | 8.905 | 14.0467 | 42.9831 | 42.7723 | | mixer_b16_224 | 128 | 0.9758 | 4.8173 | 8.0926 | 16.2722 | 41.0523 | 40.3492 | | gluon_inception_v3 | 128 | 1.6418 | 11.4686 | 17.0664 | 97.0568 | 40.4963 | 38.521 | | mixnet_l | 128 | 5.4487 | 15.1933 | 30.9701 | 87.7111 | 40.1174 | 36.4016 | | adv_inception_v3 | 128 | 1.6324 | 11.5373 | 17.2605 | 98.9266 | 39.9888 | 37.6633 | | inception_v3 | 128 | 1.6635 | 11.4382 | 17.4671 | 98.1895 | 39.8173 | 37.7171 | | beit_base_patch16_224 | 64 | 1.2378 | 6.9376 | nan | 18.6737 | 39.6996 | 37.3079 | | tf_mixnet_l | 128 | 5.7701 | 15.5782 | 32.1528 | 87.1081 | 39.5924 | 37.4604 | | dla102 | 128 | 1.8418 | 12.9528 | 19.5944 | 86.5435 | 38.8311 | 36.0248 | | convmixer_768_32 | 32 | 1.2436 | 8.6071 | 12.3605 | 17.559 | 38.2188 | 36.946 | | ghostnet_100 | 128 | 3.0034 | 12.3307 | 17.2161 | 90.2264 | 38.0134 | 36.3602 | | res2next50 | 128 | 1.6145 | 10.8424 | 16.4392 | 84.3658 | 34.8033 | 33.5014 | | dm_nfnet_f0 | 128 | 2.1292 | 9.0767 | nan | 38.1385 | 34.1963 | 32.037 | | rexnet_100 | 128 | 1.9563 | 9.4335 | 20.6214 | 117.2395 | 31.6527 | 30.2589 | | tinynet_a | 128 | 2.109 | 10.2195 | 23.9285 | 78.7993 | 31.0464 | 29.3305 | | cspdarknet53 | 64 | 2.3291 | 9.5562 | 22.8968 | 40.8709 | 28.9134 | 26.8114 | | tf_efficientnet_b0 | 128 | 1.8966 | 8.7482 | 19.4162 | 78.0265 | 27.5474 | 26.2992 | | nfnet_l0 | 128 | 1.9244 | 9.0907 | 13.0128 | 34.939 | 26.3739 | 24.8758 | | fbnetc_100 | 128 | 2.1772 | 8.3721 | 21.0565 | 59.8188 | 26.2984 | 24.6009 | | spnasnet_100 | 128 | 2.1478 | 8.2724 | 20.1714 | 57.1241 | 25.5721 | 24.1208 | | mobilenetv3_large_100 | 128 | 1.8146 | 7.2825 | 15.7371 | 82.3897 | 24.112 | 22.9963 | | repvgg_a2 | 128 | 2.0068 | 7.8114 | 17.9164 | 61.7912 | 21.9794 | 20.1666 | | mobilenetv2_100 | 128 | 1.7899 | 6.804 | 15.4585 | 40.3552 | 21.9682 | 20.8764 | | gernet_l | 128 | 1.9591 | 7.9609 | 18.4196 | 44.163 | 21.7658 | 20.5759 | | regnety_002 | 128 | 1.6211 | 7.3888 | 16.13 | 56.2891 | 21.4296 | 21.058 | | mnasnet_100 | 128 | 1.7603 | 7.0147 | 16.5129 | 50.5861 | 21.34 | 20.5011 | | selecsls42b | 128 | 0.8253 | 5.0917 | 7.5991 | 50.2702 | 18.9059 | 18.1065 | | lcnet_050 | 128 | 1.061 | 4.4009 | 8.5809 | 38.2764 | 15.3221 | 14.9451 | | ese_vovnet19b_dw | 128 | 1.0646 | 4.1179 | 7.956 | 39.0037 | 14.9559 | 14.0328 | | eca_halonext26ts | 128 | 1.4734 | 6.4076 | 13.5735 | 65.8208 | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.7887 | 1.3707 | 1.4015 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | nan | 0.9029 | 1.3139 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | nan | 0.9188 | 1.2841 | 1.2998 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.8392 | 1.173 | 1.1918 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3634 | 1.1722 | 1.1607 | 1.2789 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | 0.7848 | 1.1578 | 1.2186 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | 0.8648 | 1.1475 | 1.1687 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.776 | 1.1068 | 1.2101 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1162 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | 0.9418 | 1.0703 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | 0.9685 | 1.0556 | 1.0626 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | 0.8587 | 1.0379 | 1.1081 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | nan | 0.9443 | 1.0336 | 1.124 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.8142 | 1.0333 | 1.0762 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9129 | 1.0048 | 1.021 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | 0.9298 | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.8179 | 0.9746 | 1.2067 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | 0.9714 | 0.9746 | 0.9788 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | 0.802 | 0.9699 | 1.0818 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.315 | 0.79 | 0.9645 | 0.9776 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.9026 | 0.9489 | 0.9832 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9381 | 0.9431 | 0.9502 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3469 | 0.8884 | 0.9382 | 1.0521 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9319 | 0.9931 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | 0.8365 | 0.9314 | 1.0486 | | cait_m36_384 | 4 | 0.9998 | nan | nan | nan | 0.929 | 0.9775 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | 0.8487 | 0.9112 | 0.9354 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 0.7555 | 0.9089 | 0.9818 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | 0.8814 | 0.9072 | 0.9596 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3348 | 0.8581 | 0.8969 | 0.938 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | 0.8524 | 0.8964 | 0.9224 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.8641 | 0.8948 | 0.916 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8772 | 0.8927 | 0.9188 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | 0.8854 | 0.8924 | 0.8971 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8801 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8794 | 0.8911 | 0.8966 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.9146 | 0.8905 | 0.9028 | | convnext_base | 64 | 1.003 | 0.9263 | nan | 0.7349 | 0.8852 | 0.9866 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8538 | 0.8845 | 0.8998 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.8299 | 0.876 | 0.9007 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8285 | 0.8697 | 0.8972 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2718 | 0.7737 | 0.8653 | 0.9722 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8158 | 0.862 | 0.8897 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8602 | 0.8784 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.7908 | 0.8512 | 0.8583 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7708 | 0.8503 | 0.898 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.8252 | 0.8503 | 0.8698 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7352 | 0.8387 | 0.8542 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.7559 | 0.8309 | 0.8769 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7515 | 0.8245 | 0.8627 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 0.8842 | 0.8174 | 1.0986 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | nan | 0.8092 | 0.8236 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3515 | 0.6593 | 0.8006 | 1.035 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.6789 | 0.7903 | 0.8279 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | 0.8451 | 0.7566 | 0.9252 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.7354 | 0.745 | 0.8293 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | 0.86 | 0.6708 | 0.8619 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.267 | 0.7762 | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/udu4L3i.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/vFE6Cb0.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/oKHcS3P.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 46/56 | 100%, 43/43 | 59%, 36/61  |
|       aot_eager        | 79%, 44/56 | 100%, 43/43 | 56%, 34/61  |
|     aot_cudagraphs     | 64%, 36/56 | 49%, 21/43  |  11%, 7/61  |
|    nvprims_nvfuser     | 48%, 27/56 |  0%, 0/43   |  15%, 9/61  |
|        inductor        | 71%, 40/56 | 93%, 40/43  | 56%, 34/61  |
| inductor_no_cudagraphs | 79%, 44/56 | 93%, 40/43  | 56%, 34/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.05x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    0.0x     |    1.16x    |
|        inductor        |   1.39x    |    1.29x    |    1.23x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.84    |    2.26     |    1.68     |
|       aot_eager        |    7.30    |    10.27    |    10.56    |
|     aot_cudagraphs     |    9.57    |    20.71    |    12.53    |
|    nvprims_nvfuser     |   48.11    |     0.0     |   163.13    |
|        inductor        |   25.45    |    35.22    |    45.24    |
| inductor_no_cudagraphs |   25.56    |    30.09    |    43.82    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.95x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.90x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.81x    |    0.0x     |    0.85x    |
|        inductor        |   0.81x    |    0.71x    |    0.95x    |
| inductor_no_cudagraphs |   0.93x    |    0.96x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0065 | 0.9994 | 1.7788 | 0.6951 | 4.2215 | 1.4165 | | timm_vision_transformer | 8 | 1.0049 | 0.917 | 1.513 | 0.0 | 2.6276 | 1.3953 | | functorch_dp_cifar10 | 64 | 0.9948 | 0.95 | 1.4724 | 0.0 | 2.5679 | 1.3391 | | pytorch_struct | 200 | 0.987 | 0.7373 | 0.9244 | 0.8028 | 1.816 | 1.146 | | lennard_jones | 1000 | 0.9577 | 0.8263 | 1.0156 | 0.6741 | 1.7521 | 0.9432 | | mobilenet_v3_large | 32 | 1.0079 | 1.1149 | 0.9325 | 0.8735 | 1.7329 | 1.4215 | | hf_Albert | 8 | 1.0014 | 0.9976 | 0.7522 | 0.0 | 1.6495 | 1.6428 | | resnext50_32x4d | 8 | 1.0015 | 1.1296 | 0.9759 | 0.7562 | 1.6427 | 1.3294 | | shufflenet_v2_x1_0 | 128 | 0.9983 | 1.0146 | 0.7677 | 0.9192 | 1.5531 | 1.4015 | | speech_transformer | 32 | 1.0079 | 0.9259 | 1.5245 | 0.0 | 1.5424 | 1.5415 | | timm_resnest | 32 | 0.9993 | 1.0014 | 0.8048 | 1.1876 | 1.5178 | 1.4529 | | hf_GPT2 | 4 | 1.0106 | 0.9799 | 0.7392 | 0.4039 | 1.5023 | 1.5005 | | timm_nfnet | 128 | 1.0 | 0.9998 | 0.0 | 1.2567 | 1.4758 | 1.4237 | | resnet18 | 16 | 1.0039 | 1.1001 | 0.8931 | 0.8823 | 1.4715 | 1.2451 | | mobilenet_v2_quantized_qat | 96 | 1.0013 | 0.9747 | 0.0 | 1.4397 | 1.4333 | 1.4305 | | mobilenet_v2 | 96 | 0.9999 | 1.0 | 0.7315 | 0.0 | 1.4304 | 1.4022 | | fastNLP_Bert | 6 | 0.9984 | 0.9764 | 0.7533 | 0.0 | 1.4277 | 1.3965 | | soft_actor_critic | 256 | 0.9781 | 0.7691 | 1.0582 | 0.6291 | 1.4205 | 0.8968 | | hf_T5_large | 2 | 1.0229 | 0.8542 | 0.0 | 0.0 | 1.3981 | 1.3967 | | resnet50_quantized_qat | 32 | 1.0007 | 0.9582 | 0.0 | 1.2034 | 1.383 | 1.3819 | | mnasnet1_0 | 32 | 0.9995 | 1.0675 | 0.812 | 0.9628 | 1.3794 | 1.2884 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9944 | 0.9496 | 0.9673 | 0.8159 | 1.3694 | 1.2645 | | squeezenet1_1 | 32 | 0.9954 | 1.011 | 0.8215 | 0.7852 | 1.3678 | 1.3026 | | dcgan | 32 | 0.9818 | 1.0022 | 1.0001 | 0.7516 | 1.3119 | 1.0534 | | hf_Bart | 4 | 1.0107 | 0.9698 | 0.0 | 0.0 | 1.2663 | 1.2053 | | hf_Bert | 4 | 1.031 | 0.9917 | 0.737 | 0.0 | 1.21 | 1.1902 | | LearningToPaint | 96 | 0.9992 | 1.0012 | 0.809 | 1.0171 | 1.2093 | 1.1738 | | resnet50 | 32 | 0.9991 | 0.9882 | 0.7607 | 1.0841 | 1.2052 | 1.1682 | | pytorch_unet | 1 | 0.9996 | 0.9979 | 0.8464 | 1.0893 | 1.1984 | 1.1881 | | Super_SloMo | 6 | 0.9997 | 0.9977 | 0.8665 | 1.0026 | 1.1802 | 1.1661 | | hf_DistilBert | 8 | 0.9999 | 0.9538 | 0.6855 | 0.0 | 1.1761 | 1.1816 | | vgg16 | 64 | 0.9998 | 0.999 | 0.8583 | 0.9982 | 1.1725 | 1.1669 | | alexnet | 128 | 0.9993 | 0.9981 | 0.8032 | 1.0022 | 1.1618 | 1.1634 | | Background_Matting | 4 | 0.9999 | 1.0215 | 0.863 | 1.0826 | 1.1186 | 1.11 | | pytorch_stargan | 16 | 0.999 | 0.9836 | 0.8572 | 0.0 | 1.1156 | 1.0957 | | hf_Reformer | 4 | 0.9963 | 0.0 | 0.9199 | 0.0 | 1.105 | 1.1283 | | yolov3 | 16 | 0.9997 | 0.995 | 0.7934 | 1.1952 | 1.095 | 1.0821 | | hf_BigBird | 2 | 0.9905 | 0.9386 | 0.9544 | 0.0 | 1.0936 | 0.998 | | attention_is_all_you_need_pytorch | 256 | 1.0001 | 0.971 | 0.0 | 0.0 | 1.0661 | 1.0524 | | timm_vision_transformer_large | 8 | 0.9998 | 0.9953 | 0.0 | 0.0 | 1.0459 | 1.0329 | | tts_angular | 64 | 0.9881 | 0.9561 | 0.9839 | 0.9734 | 1.0082 | 1.002 | | demucs | 4 | 0.9996 | 0.9999 | 0.9999 | 1.0002 | 0.9994 | 0.9997 | | nvidia_deeprecommender | 256 | 0.9983 | 0.9627 | 0.584 | 0.858 | 0.904 | 0.9639 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 1.09 | 0.0 | 1.0697 | | hf_T5 | 8 | 1.0015 | 0.8183 | 0.0 | 0.0 | 0.0 | 1.1069 | | tacotron2 | 64 | 0.9725 | 0.8235 | 0.0 | 0.0 | 0.0 | 0.8981 | | hf_GPT2_large | 4 | 1.0003 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4766 | | hf_Longformer | 2 | 0.9625 | 0.883 | 0.8164 | 0.0 | 0.0 | 0.0 | | BERT_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_regnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_vovnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | BERT_pytorch | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | | DALLE2_pytorch | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | drq | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.7848 | 9.5886 | 12.9709 | 117.076 | 368.8999 | 367.7649 | | hf_T5_large | 2 | 13.5072 | 47.4256 | nan | nan | 125.744 | 124.048 | | timm_resnest | 32 | 0.542 | 2.9033 | 4.0615 | 53.3697 | 67.9221 | 68.6143 | | timm_vision_transformer_large | 8 | 2.2759 | 15.9163 | nan | nan | 67.0906 | 64.7414 | | attention_is_all_you_need_pytorch | 256 | 1.1076 | 8.1934 | nan | nan | 56.8414 | 55.306 | | timm_vision_transformer | 8 | 0.7851 | 4.8258 | 6.5654 | nan | 50.9499 | 51.2718 | | pytorch_stargan | 16 | 0.3697 | 2.6485 | 3.4622 | nan | 49.2674 | 48.7708 | | densenet121 | 4 | 2.0651 | 14.8471 | 22.0861 | 201.9622 | 46.5356 | 45.8812 | | hf_BigBird | 2 | 7.4195 | 14.5841 | 30.2365 | nan | 41.1527 | 27.3469 | | pytorch_struct | 200 | 0.2409 | 0.8894 | 1.5967 | 5.5141 | 36.7635 | 36.8974 | | resnet50_quantized_qat | 32 | 1.1107 | 10.0884 | nan | 173.4417 | 32.221 | 32.275 | | hf_Bart | 4 | 1.4184 | 9.017 | nan | nan | 31.849 | 30.1628 | | mobilenet_v3_large | 32 | 0.8483 | 5.3585 | 7.5099 | 96.0824 | 30.4483 | 29.4493 | | timm_nfnet | 128 | 2.0146 | 8.503 | nan | 162.6666 | 29.8705 | 29.2796 | | fastNLP_Bert | 6 | 1.4754 | 7.4864 | 11.0104 | nan | 29.1192 | 27.8205 | | mobilenet_v2_quantized_qat | 96 | 1.2546 | 10.1663 | nan | 196.7506 | 28.4796 | 28.487 | | speech_transformer | 32 | 1.5911 | 9.5136 | 58.1918 | nan | 28.2649 | 27.4044 | | hf_Reformer | 4 | 2.4185 | nan | 9.8505 | nan | 27.7055 | 22.0631 | | mnasnet1_0 | 32 | 0.7518 | 4.9919 | 6.9672 | 66.0724 | 23.3112 | 21.8372 | | resnext50_32x4d | 8 | 0.8349 | 5.4511 | 7.5616 | 67.8433 | 22.7538 | 20.6724 | | hf_Albert | 8 | 1.0435 | 6.7164 | 9.5072 | nan | 22.3824 | 21.7073 | | resnet50 | 32 | 0.8084 | 5.6553 | 7.443 | 73.0 | 22.0201 | 22.0886 | | hf_Bert | 4 | 1.3516 | 7.009 | 9.7999 | nan | 20.7511 | 20.063 | | hf_GPT2 | 4 | 1.246 | 6.8805 | 10.0977 | 73.9618 | 20.5855 | 20.1218 | | shufflenet_v2_x1_0 | 128 | 0.8815 | 6.1806 | 8.3016 | 83.2298 | 19.5393 | 18.1063 | | Super_SloMo | 6 | 0.9893 | 5.3455 | 7.0804 | 34.2139 | 17.801 | 17.1429 | | Background_Matting | 4 | 0.8287 | 5.2383 | 7.7041 | 58.6919 | 17.7819 | 16.3323 | | mobilenet_v2 | 96 | 0.7376 | 5.1021 | 7.2056 | nan | 17.2772 | 17.5484 | | functorch_dp_cifar10 | 64 | 0.3891 | 2.2968 | 3.1644 | nan | 16.9364 | 16.5698 | | hf_DistilBert | 8 | 0.4481 | 3.5967 | 6.4314 | nan | 14.4217 | 13.7809 | | resnet18 | 16 | 0.3897 | 2.1039 | 2.8888 | 28.146 | 13.4634 | 14.0684 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3597 | 2.5366 | 3.2709 | 25.9788 | 8.5797 | 8.3874 | | pytorch_unet | 1 | 0.4317 | 2.3488 | 3.1857 | 25.5648 | 8.5192 | 8.2043 | | LearningToPaint | 96 | 0.4187 | 2.2007 | 3.0445 | 35.9666 | 7.5207 | 7.3226 | | squeezenet1_1 | 32 | 0.2209 | 1.0642 | 1.5533 | 5.0214 | 4.2396 | 3.9154 | | vgg16 | 64 | 0.1892 | 0.7272 | 1.1591 | 3.9105 | 3.5836 | 3.4519 | | nvidia_deeprecommender | 256 | 0.1923 | 0.4788 | 0.7278 | 5.9264 | 3.4807 | 3.216 | | soft_actor_critic | 256 | 0.1989 | 0.3589 | 0.5134 | 2.4526 | 3.3991 | 2.8458 | | alexnet | 128 | 0.1441 | 0.4498 | 0.7233 | 3.6339 | 3.0565 | 2.7715 | | dcgan | 32 | 0.1657 | 0.4822 | 0.7116 | 4.2342 | 2.7374 | 2.5303 | | lennard_jones | 1000 | 0.1377 | 0.3196 | 0.5378 | 2.1834 | 2.0265 | 1.8215 | | tts_angular | 64 | 0.2077 | 0.2735 | 0.4572 | 1.0477 | 1.9005 | 1.7362 | | demucs | 4 | 0.3077 | 0.3135 | 0.3232 | 0.3089 | 0.2151 | 0.2173 | | tacotron2 | 64 | 17.5224 | 31.345 | nan | nan | nan | 68.0676 | | hf_GPT2_large | 4 | 4.9526 | 21.4987 | nan | nan | nan | 48.6542 | | hf_T5 | 8 | 2.243 | 11.5182 | nan | nan | nan | 29.3693 | | dlrm | 2048 | nan | nan | nan | 4.487 | nan | 3.2098 | | hf_Longformer | 2 | 6.012 | 16.301 | 79.761 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | nan | nan | | drq | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9971 | 0.9148 | nan | 0.8492 | 1.4304 | 1.4304 | | mobilenet_v2_quantized_qat | 96 | 0.9961 | 0.8279 | nan | 0.8271 | 1.404 | 1.404 | | Super_SloMo | 6 | 1.0023 | 0.9526 | 0.363 | 0.9527 | 1.1857 | 1.1913 | | mobilenet_v2 | 96 | 0.9923 | 0.7624 | 0.3061 | nan | 1.1003 | 1.11 | | squeezenet1_1 | 32 | 0.9781 | 0.8163 | 0.3371 | 0.8132 | 1.0821 | 1.1262 | | speech_transformer | 32 | 0.9982 | 0.9159 | 0.2703 | nan | 1.0395 | 1.042 | | timm_nfnet | 128 | 0.9358 | 0.8937 | nan | 0.879 | 1.0221 | 1.0495 | | demucs | 4 | 0.9888 | 0.9884 | 0.9888 | 0.9884 | 0.9884 | 0.9884 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.814 | 0.9789 | 1.0066 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1094 | | Background_Matting | 4 | 0.9989 | 0.9483 | 0.3594 | 0.9323 | 0.9204 | 0.9231 | | yolov3 | 16 | 0.9893 | 0.8384 | 0.3319 | 0.8043 | 0.9089 | 0.9128 | | pytorch_stargan | 16 | 0.9975 | 1.009 | 0.4108 | nan | 0.9015 | 0.9845 | | timm_resnest | 32 | 0.9926 | 0.8759 | 0.3223 | 0.7296 | 0.8947 | 0.964 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9117 | 0.3921 | 0.8949 | 0.8928 | 0.9624 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9876 | 0.856 | 0.3277 | 0.7754 | 0.8832 | 0.8974 | | densenet121 | 4 | 1.0 | 0.8879 | 0.3452 | 0.8612 | 0.8624 | 0.943 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8613 | 0.922 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8388 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9945 | 0.8704 | 0.3364 | 0.7952 | 0.8551 | 0.8906 | | mnasnet1_0 | 32 | 0.9878 | 0.8992 | 0.3334 | 0.8252 | 0.8532 | 0.8671 | | hf_Bart | 4 | 0.9617 | 0.8777 | nan | nan | 0.8504 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | nan | 0.8354 | 1.1229 | | resnext50_32x4d | 8 | 0.9961 | 0.8679 | 0.3585 | 0.8188 | 0.8278 | 0.8346 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4299 | nan | 0.8211 | 1.0392 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | timm_vision_transformer | 8 | 0.9943 | 0.8874 | 0.3309 | nan | 0.7507 | 0.8213 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9304 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7449 | 0.743 | 0.8332 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7061 | 1.0016 | | LearningToPaint | 96 | 0.9454 | 0.6943 | 0.3399 | 0.627 | 0.6945 | 0.7512 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6949 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3213 | nan | 0.6595 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4867 | 0.6508 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | nan | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | tacotron2 | 64 | 0.9906 | 1.0302 | nan | nan | nan | 1.1494 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1434 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | dlrm | 2048 | nan | nan | nan | 0.7307 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2946 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | nan | nan | | drq | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0274 | 0.8994 | 0.0 | 0.0 | 3.1609 | 1.4393 | | DistillGPT2 | 1 | 1.0344 | 0.9148 | 1.0469 | 0.2987 | 2.4099 | 1.8225 | | CamemBert | 1 | 1.0453 | 0.9298 | 1.3249 | 0.0 | 2.403 | 1.5084 | | MobileBertForMaskedLM | 32 | 1.0238 | 0.9295 | 0.0 | 0.0 | 2.323 | 1.535 | | MT5ForConditionalGeneration | 8 | 1.0219 | 0.8758 | 0.0 | 0.0 | 2.2525 | 1.9527 | | GoogleFnet | 1 | 0.9926 | 0.8024 | 0.9835 | 0.0 | 1.9172 | 1.0977 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9771 | 0.0 | 0.656 | 1.7888 | 1.7767 | | ElectraForQuestionAnswering | 64 | 1.0002 | 0.9853 | 0.0 | 0.0 | 1.4272 | 1.4082 | | ElectraForCausalLM | 32 | 1.0003 | 0.9302 | 0.0 | 0.0 | 1.4136 | 1.4512 | | M2M100ForConditionalGeneration | 8 | 1.0115 | 1.008 | 1.0862 | 0.0 | 1.4108 | 1.3048 | | MobileBertForQuestionAnswering | 64 | 1.0253 | 0.9099 | 0.0 | 0.0 | 1.403 | 1.2783 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9892 | 0.7372 | 0.0 | 1.3017 | 1.2911 | | AlbertForQuestionAnswering | 4 | 0.9997 | 1.0023 | 0.0 | 0.0 | 1.2572 | 1.2514 | | AlbertForMaskedLM | 4 | 1.0008 | 1.0004 | 0.0 | 0.0 | 1.2531 | 1.251 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.011 | 0.7604 | 0.0 | 1.2229 | 1.1182 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9707 | 0.0 | 0.0 | 1.2126 | 1.2151 | | PLBartForConditionalGeneration | 16 | 1.0146 | 0.9687 | 0.0 | 0.0 | 1.2002 | 1.1992 | | OPTForCausalLM | 32 | 1.0022 | 0.9316 | 0.0 | 0.0 | 1.1817 | 1.2021 | | XGLMForCausalLM | 8 | 1.0114 | 0.939 | 0.7997 | 0.0 | 1.1778 | 1.1847 | | T5Small | 1 | 1.0276 | 0.883 | 0.0 | 0.0 | 1.1776 | 1.1836 | | DebertaForMaskedLM | 4 | 0.9252 | 0.7847 | 0.732 | 0.0 | 1.1774 | 1.1554 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.9839 | 0.7132 | 0.0 | 1.1708 | 1.1514 | | RobertaForCausalLM | 64 | 1.0 | 0.9629 | 0.7451 | 0.0 | 1.1458 | 1.1498 | | MegatronBertForCausalLM | 16 | 1.0334 | 1.0044 | 0.7435 | 0.0 | 1.1312 | 1.1213 | | Speech2Text2ForCausalLM | 128 | 0.9978 | 0.9276 | 0.6566 | 0.0 | 1.1236 | 1.1489 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9922 | 0.0 | 0.0 | 1.1177 | 1.114 | | BertForQuestionAnswering | 128 | 0.9996 | 0.9923 | 0.0 | 0.0 | 1.1139 | 1.1072 | | BartForCausalLM | 4 | 1.0007 | 0.966 | 0.0 | 0.0 | 1.0993 | 1.1104 | | BartForConditionalGeneration | 2 | 1.0004 | 0.9883 | 0.0 | 0.0 | 1.0985 | 1.0943 | | MBartForConditionalGeneration | 16 | 1.0117 | 0.9775 | 0.0 | 0.0 | 1.0969 | 1.0807 | | PegasusForConditionalGeneration | 16 | 1.0104 | 0.9797 | 0.7637 | 0.0 | 1.0882 | 1.0819 | | BigBird | 1 | 0.9914 | 0.929 | 0.9985 | 0.0 | 1.0849 | 0.9995 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9404 | 0.0 | 0.0 | 1.0647 | 1.0715 | | BertForMaskedLM | 64 | 1.0003 | 0.9616 | 0.7302 | 0.0 | 1.0566 | 1.0618 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9505 | 0.7121 | 0.0 | 1.0529 | 1.0681 | | DebertaForQuestionAnswering | 8 | 0.9966 | 0.9693 | 0.6845 | 0.0 | 1.0506 | 1.221 | | PLBartForCausalLM | 32 | 1.0052 | 0.9344 | 0.7164 | 0.0 | 1.0243 | 1.0556 | | T5ForConditionalGeneration | 4 | 0.9999 | 0.8151 | 0.0 | 0.0 | 1.0187 | 1.0185 | | BlenderbotSmallForCausalLM | 64 | 1.0011 | 0.9092 | 0.6814 | 0.0 | 1.0081 | 1.0439 | | TrOCRForCausalLM | 32 | 1.0002 | 0.955 | 0.0 | 0.0 | 1.0043 | 1.0147 | | MBartForCausalLM | 32 | 1.0013 | 0.9515 | 0.0 | 0.0 | 0.9993 | 1.0109 | | PegasusForCausalLM | 32 | 0.9993 | 0.953 | 0.7302 | 0.0 | 0.9915 | 1.0055 | | AllenaiLongformerBase | 1 | 0.9453 | 0.8481 | 0.7819 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.6733 | 12.4129 | 46.2723 | nan | 106.2761 | 36.9712 | | DebertaForMaskedLM | 4 | 4.6037 | 12.0142 | 45.6034 | nan | 102.6161 | 35.206 | | XGLMForCausalLM | 8 | 2.285 | 13.963 | 43.1733 | nan | 94.4449 | 91.7467 | | M2M100ForConditionalGeneration | 8 | 2.8676 | 15.3441 | 23.8784 | nan | 75.1194 | 72.388 | | MobileBertForMaskedLM | 32 | 8.0322 | 31.4395 | nan | nan | 59.6722 | 57.9264 | | MobileBertForQuestionAnswering | 64 | 8.0065 | 31.3134 | nan | nan | 58.8377 | 56.0146 | | YituTechConvBert | 1 | 2.1143 | 11.2822 | nan | nan | 54.4228 | 51.3304 | | PegasusForConditionalGeneration | 16 | 2.6332 | 16.745 | 26.3799 | nan | 48.9728 | 45.368 | | BartForConditionalGeneration | 2 | 2.8283 | 17.3733 | nan | nan | 48.6188 | 48.0802 | | MBartForConditionalGeneration | 16 | 2.8415 | 17.2493 | nan | nan | 48.5545 | 45.9787 | | BigBird | 1 | 7.3519 | 14.4816 | 30.0384 | nan | 41.5339 | 26.1666 | | MT5ForConditionalGeneration | 8 | 3.5812 | 15.6366 | nan | nan | 40.4273 | 38.4155 | | MegatronBertForCausalLM | 16 | 2.9745 | 14.6119 | 20.8656 | nan | 37.6765 | 36.3074 | | MegatronBertForQuestionAnswering | 16 | 2.9706 | 14.6623 | 21.4667 | nan | 37.449 | 35.5244 | | T5Small | 1 | 2.2716 | 10.8048 | nan | nan | 37.3498 | 36.6486 | | LayoutLMForSequenceClassification | 16 | 1.6799 | 7.7567 | 11.2872 | nan | 35.2088 | 34.3709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7584 | 11.3554 | nan | nan | 34.4824 | 32.566 | | T5ForConditionalGeneration | 4 | 2.3765 | 10.8948 | nan | nan | 31.9484 | 31.3496 | | PLBartForConditionalGeneration | 16 | 1.3762 | 9.1425 | nan | nan | 30.8008 | 29.9136 | | ElectraForCausalLM | 32 | 1.3496 | 7.1978 | nan | nan | 29.8169 | 27.2971 | | PegasusForCausalLM | 32 | 1.0155 | 6.4968 | 10.2756 | nan | 24.1657 | 22.4839 | | MBartForCausalLM | 32 | 0.9605 | 6.8031 | nan | nan | 23.2513 | 22.9468 | | LayoutLMForMaskedLM | 16 | 1.6594 | 7.7534 | nan | nan | 23.2343 | 22.3658 | | BertForMaskedLM | 64 | 1.3389 | 7.1091 | 10.1501 | nan | 22.9853 | 22.4101 | | ElectraForQuestionAnswering | 64 | 1.3342 | 7.1468 | nan | nan | 22.5211 | 21.6752 | | TrOCRForCausalLM | 32 | 1.0458 | 6.6678 | nan | nan | 22.3064 | 21.5718 | | GoogleFnet | 1 | 0.7909 | 3.8161 | 10.7298 | nan | 22.1676 | 14.5891 | | BartForCausalLM | 4 | 1.0266 | 6.5242 | nan | nan | 22.1044 | 20.6279 | | BertForQuestionAnswering | 128 | 1.3416 | 7.1515 | nan | nan | 21.4984 | 20.6009 | | RobertaForCausalLM | 64 | 1.3662 | 7.1342 | 10.4869 | nan | 21.2916 | 20.4156 | | RobertaForQuestionAnswering | 128 | 1.372 | 7.2267 | nan | nan | 20.3739 | 19.4477 | | CamemBert | 1 | 1.4018 | 7.183 | 9.7338 | nan | 20.2566 | 20.0063 | | OPTForCausalLM | 32 | 1.0485 | 6.6749 | nan | nan | 20.0646 | 19.3042 | | GPT2ForSequenceClassification | 4 | 1.359 | 7.2231 | nan | 72.6465 | 18.8184 | 18.6945 | | AlbertForMaskedLM | 4 | 1.0888 | 6.8413 | nan | nan | 18.791 | 17.2833 | | AlbertForQuestionAnswering | 4 | 0.9827 | 6.6095 | nan | nan | 17.8325 | 16.7729 | | BlenderbotSmallForCausalLM | 64 | 0.6506 | 4.3553 | 6.4625 | nan | 17.7808 | 16.9874 | | DistillGPT2 | 1 | 0.653 | 3.5396 | 4.6937 | 44.0522 | 16.6466 | 16.7649 | | Speech2Text2ForCausalLM | 128 | 0.593 | 3.4546 | 5.5215 | nan | 16.183 | 15.0192 | | PLBartForCausalLM | 32 | 0.4975 | 3.5231 | 4.8355 | nan | 15.43 | 14.999 | | DistilBertForMaskedLM | 64 | 0.448 | 3.5612 | 6.431 | nan | 13.5483 | 12.8553 | | DistilBertForQuestionAnswering | 64 | 0.4582 | 3.5185 | 6.4368 | nan | 12.7988 | 12.223 | | AllenaiLongformerBase | 1 | 6.0731 | 15.7322 | 80.1044 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8817 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4213 | nan | 0.8224 | 1.0095 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | nan | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8115 | 0.3773 | 0.7597 | 0.807 | 0.926 | | T5Small | 1 | 1.0 | 0.8947 | nan | nan | 0.7934 | 1.0493 | | ElectraForCausalLM | 32 | 0.9983 | 0.883 | nan | nan | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9858 | 0.8581 | nan | nan | 0.7893 | 0.8727 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.931 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.7711 | 1.1049 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | nan | 0.7698 | 0.9373 | | M2M100ForConditionalGeneration | 8 | 1.0 | 0.9809 | 0.3975 | nan | 0.7621 | 1.0093 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8862 | nan | nan | 0.7603 | 0.9397 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3615 | nan | 0.7487 | 0.9184 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8957 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.842 | 0.3524 | nan | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3642 | nan | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8355 | nan | nan | 0.4998 | 0.6646 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 0.9991 | 0.9843 | 0.3553 | nan | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.321 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | dm_nfnet_f0 | 128 | 1.0 | 1.0002 | 0.0 | 0.0 | 1.4735 | 1.4264 | | convnext_base | 64 | 0.9999 | 0.9991 | 0.0 | 0.0 | 1.4727 | 1.4639 | | hrnet_w18 | 128 | 1.0 | 0.9998 | 0.0 | 0.0 | 1.4184 | 1.3803 | | dla102 | 128 | 0.9998 | 1.0007 | 0.0 | 0.0 | 1.3851 | 1.37 | | volo_d1_224 | 64 | 0.9999 | 0.9961 | 0.0 | 0.0 | 1.3833 | 1.3619 | | nfnet_l0 | 128 | 1.0001 | 0.7889 | 0.0 | 1.2287 | 1.3759 | 1.3274 | | res2net50_14w_8s | 128 | 0.9998 | 1.0003 | 0.0 | 1.2439 | 1.3551 | 1.3256 | | xcit_large_24_p8_224 | 5 | 1.0007 | 0.9756 | 0.0 | 0.0 | 1.3514 | 1.3164 | | adv_inception_v3 | 128 | 1.0 | 0.9975 | 0.0 | 1.1287 | 1.3286 | 1.3079 | | crossvit_9_240 | 128 | 0.9996 | 0.9996 | 0.0 | 0.0 | 1.3285 | 1.3029 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1286 | 1.3276 | 1.3078 | | gluon_inception_v3 | 128 | 0.9999 | 0.999 | 0.0 | 1.1285 | 1.3273 | 1.3071 | | resnest101e | 64 | 1.0 | 1.0042 | 0.0 | 0.0 | 1.3144 | 1.2722 | | res2next50 | 128 | 0.9999 | 1.0011 | 0.0 | 1.1754 | 1.3129 | 1.2744 | | jx_nest_base | 32 | 0.9998 | 0.9953 | 0.0 | 0.0 | 1.2785 | 1.2524 | | coat_lite_mini | 128 | 0.9998 | 0.9858 | 0.8519 | 0.0 | 1.2715 | 1.2635 | | selecsls42b | 128 | 1.0 | 1.0004 | 0.8156 | 1.2097 | 1.2682 | 1.2534 | | gmixer_24_224 | 128 | 0.9998 | 0.8101 | 0.0 | 0.0 | 1.2437 | 1.2275 | | res2net101_26w_4s | 64 | 0.9999 | 1.0005 | 0.7725 | 1.1515 | 1.2264 | 1.1914 | | convit_base | 64 | 0.9997 | 0.9989 | 0.0 | 0.0 | 1.211 | 1.2407 | | twins_pcpvt_base | 64 | 0.9999 | 0.9992 | 0.7538 | 0.0 | 1.2033 | 1.1715 | | gmlp_s16_224 | 128 | 1.0 | 0.9501 | 0.0 | 0.0 | 1.201 | 1.1884 | | pit_b_224 | 64 | 0.9998 | 0.999 | 0.0 | 0.0 | 1.1874 | 1.1783 | | cait_m36_384 | 4 | 0.9999 | 1.0266 | 0.0 | 0.0 | 1.185 | 1.1592 | | poolformer_m36 | 64 | 0.9998 | 0.9987 | 0.0 | 0.0 | 1.1663 | 1.148 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9779 | 0.0 | 0.0 | 1.1428 | 1.1282 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9761 | 0.0 | 0.0 | 1.1119 | 1.103 | | swsl_resnext101_32x16d | 32 | 1.0 | 1.0004 | 0.0 | 0.0 | 1.1069 | 1.0711 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9989 | 0.7667 | 0.0 | 1.0958 | 1.0833 | | vit_base_patch16_224 | 64 | 0.9999 | 0.999 | 0.7661 | 0.0 | 1.0856 | 1.0758 | | gluon_xception65 | 32 | 0.9998 | 0.9973 | 0.0 | 0.0 | 1.0854 | 1.0747 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 0.0 | 1.0784 | 1.0746 | | mixer_b16_224 | 128 | 0.9998 | 0.9785 | 0.0 | 0.0 | 1.0671 | 1.0623 | | visformer_small | 128 | 1.0001 | 1.0031 | 0.8001 | 1.0483 | 1.0473 | 1.0133 | | resmlp_12_224 | 128 | 0.9998 | 0.8551 | 0.6124 | 0.0 | 0.7893 | 0.7996 | | mnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_efficientnet_b0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | spnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | sebotnet33ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | rexnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | repvgg_a2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | regnety_002 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | pnasnet5large | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilevit_s | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv3_large_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv2_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9995 | 0.0 | 0.0 | 0.0 | 1.5459 | | mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | lcnet_050 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ghostnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | gernet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetv3_b | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetc_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ese_vovnet19b_dw | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_halonext26ts | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_botnext26ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | dpn107 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | cspdarknet53 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | botnet26t_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tinynet_a | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | pit_b_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | fail_accuracy | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.7744 | 35.9932 | nan | nan | 109.042 | 104.8879 | | swin_base_patch4_window7_224 | 64 | 2.6122 | 14.6004 | nan | nan | 96.2962 | 93.4446 | | xcit_large_24_p8_224 | 5 | 2.6862 | 19.6497 | nan | nan | 88.3393 | 84.7322 | | twins_pcpvt_base | 64 | 2.0976 | 14.8342 | 23.7204 | nan | 87.5503 | 85.5358 | | cait_m36_384 | 4 | 2.6883 | 20.7006 | nan | nan | 75.0498 | 72.2763 | | convnext_base | 64 | 1.2612 | 6.8385 | nan | nan | 75.0447 | 74.6436 | | jx_nest_base | 32 | 1.7318 | 10.3587 | nan | nan | 71.0908 | 68.0075 | | resnest101e | 64 | 2.9392 | 18.7171 | nan | nan | 69.3124 | 66.6202 | | coat_lite_mini | 128 | 1.0473 | 5.9613 | 8.6703 | nan | 65.1658 | 64.3286 | | res2net101_26w_4s | 64 | 2.8458 | 19.537 | 30.4307 | 317.19 | 57.5426 | 53.879 | | res2net50_14w_8s | 128 | 2.6163 | 17.5772 | nan | 315.6623 | 52.2167 | 50.2404 | | poolformer_m36 | 64 | 1.8245 | 10.9415 | nan | nan | 48.0512 | 45.9206 | | gmlp_s16_224 | 128 | 0.9645 | 7.3639 | nan | nan | 47.071 | 46.2485 | | crossvit_9_240 | 128 | 1.3515 | 9.1943 | nan | nan | 45.3486 | 43.4715 | | volo_d1_224 | 64 | 1.2141 | 8.7407 | nan | nan | 42.0829 | 39.3043 | | gluon_xception65 | 32 | 1.7561 | 12.4352 | nan | nan | 41.4439 | 38.8379 | | gmixer_24_224 | 128 | 1.0646 | 8.2843 | nan | nan | 36.0991 | 34.8305 | | gluon_inception_v3 | 128 | 1.5229 | 10.147 | nan | 130.7291 | 35.9322 | 33.2685 | | adv_inception_v3 | 128 | 1.5176 | 9.977 | nan | 135.0223 | 35.4115 | 33.3396 | | inception_v3 | 128 | 1.4614 | 10.3638 | nan | 137.5355 | 35.3531 | 33.0883 | | swsl_resnext101_32x16d | 32 | 1.6615 | 10.9084 | nan | nan | 35.3218 | 33.7803 | | dla102 | 128 | 1.6942 | 11.2347 | nan | nan | 33.0332 | 32.0553 | | convit_base | 64 | 1.0973 | 6.8263 | nan | nan | 32.42 | 30.5673 | | dm_nfnet_f0 | 128 | 2.0821 | 8.6473 | nan | nan | 31.377 | 29.698 | | res2next50 | 128 | 1.5542 | 9.7786 | nan | 158.5107 | 30.6184 | 28.4858 | | convmixer_768_32 | 32 | 1.2125 | 7.1641 | nan | nan | 26.4853 | 25.041 | | resmlp_12_224 | 128 | 0.627 | 3.2646 | 6.0164 | nan | 26.4765 | 24.6236 | | visformer_small | 128 | 0.9074 | 5.0025 | 6.7307 | 58.1826 | 26.44 | 25.9071 | | mixer_b16_224 | 128 | 0.6647 | 3.6984 | nan | nan | 24.8814 | 23.765 | | nfnet_l0 | 128 | 1.7481 | 8.5147 | nan | 149.0261 | 23.1067 | 22.222 | | deit_base_distilled_patch16_224 | 64 | 0.8721 | 5.0017 | 7.1931 | nan | 22.4918 | 21.5085 | | beit_base_patch16_224 | 64 | 1.112 | 6.1711 | nan | nan | 22.3875 | 21.0302 | | vit_base_patch16_224 | 64 | 0.8238 | 4.9387 | 7.1712 | nan | 22.1871 | 21.2592 | | pit_b_224 | 64 | 0.9564 | 6.0458 | nan | nan | 19.8434 | 18.8503 | | selecsls42b | 128 | 0.7507 | 4.4933 | 6.4294 | 66.2942 | 16.8034 | 16.0009 | | tnt_s_patch16_224 | 128 | 1.5617 | 11.673 | nan | nan | nan | 36.4293 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | nan | 1.5552 | 1.6267 | | nfnet_l0 | 128 | 0.993 | 0.8275 | nan | 0.8271 | 1.2906 | 1.3388 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1185 | 1.1746 | | poolformer_m36 | 64 | 0.9983 | 0.9509 | nan | nan | 1.0521 | 1.0698 | | dm_nfnet_f0 | 128 | 0.9357 | 0.894 | nan | nan | 1.0221 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 0.9993 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | nan | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | nan | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9924 | 1.0673 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | nan | 0.9837 | 1.001 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9836 | 0.9853 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | nan | 0.9827 | 1.0538 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | nan | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | nan | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9828 | 0.9169 | nan | nan | 0.9489 | 0.9538 | | hrnet_w18 | 128 | 0.9955 | 0.9252 | nan | nan | 0.9378 | 0.9419 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | nan | 0.9348 | 1.0548 | | gluon_xception65 | 32 | 0.9975 | 0.9358 | nan | nan | 0.9343 | 0.9368 | | res2net101_26w_4s | 64 | 0.9967 | 0.9278 | 0.3243 | 0.8769 | 0.93 | 0.9563 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9126 | 0.9981 | | res2next50 | 128 | 0.9955 | 0.9149 | nan | 0.8461 | 0.9075 | 0.9311 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0464 | | visformer_small | 128 | 0.9944 | 0.9374 | 0.3291 | 0.9283 | 0.9029 | 0.9502 | | selecsls42b | 128 | 0.9885 | 0.8897 | 0.337 | 0.8775 | 0.8987 | 0.919 | | gluon_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.8985 | 0.9073 | | inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.8985 | 0.9073 | | adv_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.8985 | 0.9073 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.8965 | nan | nan | 0.8913 | 0.923 | | res2net50_14w_8s | 128 | 0.995 | 0.9047 | nan | 0.8422 | 0.8821 | 0.9326 | | pit_b_224 | 64 | 0.9968 | 0.7946 | nan | nan | 0.8563 | 1.0631 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | nan | 0.8208 | 0.9438 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | nan | 0.7899 | 0.7979 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | nan | nan | 0.6584 | 0.8854 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/rPUhkzN.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/xyjolpy.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/NKKwPQP.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 83%, 45/54 | 100%, 42/42 | 59%, 36/61  |
|       aot_eager        | 83%, 45/54 | 100%, 42/42 | 54%, 33/61  |
|     aot_cudagraphs     | 67%, 36/54 | 57%, 24/42  | 38%, 23/61  |
|    nvprims_nvfuser     | 20%, 11/54 |  5%, 2/42   |  3%, 2/61   |
|        inductor        | 72%, 39/54 | 93%, 39/42  | 57%, 35/61  |
| inductor_no_cudagraphs | 76%, 41/54 | 93%, 39/42  | 57%, 35/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.06x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.00x    |    1.06x    |
|        inductor        |   1.70x    |    1.83x    |    1.44x    |
| inductor_no_cudagraphs |   1.39x    |    1.54x    |    1.39x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.03    |    2.71     |    1.98     |
|       aot_eager        |    8.70    |    14.17    |    13.21    |
|     aot_cudagraphs     |   13.13    |    24.78    |    20.89    |
|    nvprims_nvfuser     |   27.19    |    67.81    |   156.64    |
|        inductor        |   28.85    |    42.06    |    54.36    |
| inductor_no_cudagraphs |   29.20    |    36.56    |    52.35    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.95x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.90x    |
|     aot_cudagraphs     |   0.42x    |    0.39x    |    0.33x    |
|    nvprims_nvfuser     |   0.75x    |    0.81x    |    0.66x    |
|        inductor        |   0.82x    |    0.91x    |    0.95x    |
| inductor_no_cudagraphs |   0.92x    |    1.08x    |    1.01x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0023 | 0.8786 | 1.8766 | 0.0 | 5.156 | 1.4841 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.8741 | 1.705 | 0.0 | 3.4817 | 1.5288 | | timm_vision_transformer | 8 | 1.0047 | 0.8346 | 1.7433 | 0.0 | 3.1568 | 1.5301 | | resnext50_32x4d | 8 | 1.0054 | 0.9191 | 1.2748 | 0.0 | 2.4375 | 1.3846 | | mobilenet_v3_large | 32 | 1.0058 | 0.9991 | 1.1912 | 0.0 | 2.4128 | 1.5427 | | hf_Albert | 8 | 1.0007 | 0.9549 | 0.7753 | 0.0 | 2.3821 | 2.326 | | hf_T5_large | 2 | 1.016 | 0.8199 | 0.0 | 0.0 | 2.3545 | 1.95 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0005 | 0.8851 | 1.3562 | 0.0 | 2.1698 | 1.5425 | | pytorch_struct | 200 | 0.9918 | 0.7275 | 1.0011 | 0.6845 | 2.0745 | 1.2781 | | mnasnet1_0 | 32 | 0.9996 | 1.0054 | 1.0185 | 0.0 | 2.0674 | 1.532 | | hf_Bart | 4 | 1.0129 | 0.8201 | 0.0 | 0.0 | 2.0664 | 1.697 | | hf_Bert | 4 | 1.032 | 0.84 | 0.9358 | 0.0 | 2.0579 | 1.8521 | | lennard_jones | 1000 | 0.961 | 0.7242 | 1.278 | 0.4482 | 2.021 | 1.0462 | | resnet18 | 16 | 1.0028 | 0.9856 | 1.12 | 0.0 | 1.9422 | 1.3952 | | hf_GPT2 | 4 | 1.0216 | 0.9801 | 0.0 | 0.2938 | 1.9176 | 1.9127 | | timm_resnest | 32 | 0.9982 | 0.9863 | 0.8119 | 0.0 | 1.8523 | 1.7115 | | squeezenet1_1 | 32 | 0.9988 | 0.939 | 1.0997 | 0.678 | 1.8193 | 1.4651 | | soft_actor_critic | 256 | 0.9818 | 0.6974 | 1.3007 | 0.5223 | 1.7091 | 1.0214 | | dcgan | 32 | 0.9341 | 0.8773 | 1.0859 | 0.0 | 1.7057 | 1.111 | | speech_transformer | 32 | 1.0055 | 0.8367 | 1.9138 | 0.0 | 1.7001 | 1.6927 | | hf_T5 | 8 | 0.9992 | 0.8522 | 0.0 | 0.0 | 1.5835 | 1.5824 | | mobilenet_v2 | 96 | 1.0 | 0.9892 | 0.7584 | 0.0 | 1.5616 | 1.5175 | | fastNLP_Bert | 6 | 0.9988 | 0.8896 | 0.764 | 0.0 | 1.5511 | 1.4777 | | attention_is_all_you_need_pytorch | 256 | 1.0063 | 0.8992 | 0.0 | 0.0 | 1.5497 | 1.4988 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0365 | 0.8798 | 0.0 | 1.5175 | 1.394 | | hf_DistilBert | 8 | 1.0011 | 0.9719 | 0.7425 | 0.0 | 1.5153 | 1.486 | | timm_nfnet | 128 | 0.9992 | 0.9992 | 0.0 | 1.1158 | 1.5059 | 1.4349 | | LearningToPaint | 96 | 1.0015 | 0.9925 | 0.903 | 0.0 | 1.4868 | 1.3434 | | resnet50 | 32 | 1.0021 | 1.0202 | 0.8127 | 0.0 | 1.3483 | 1.3074 | | pytorch_unet | 1 | 0.9991 | 0.9922 | 0.8612 | 0.0 | 1.3423 | 1.3165 | | Super_SloMo | 6 | 0.9996 | 0.9959 | 0.8842 | 0.0 | 1.287 | 1.2579 | | vgg16 | 64 | 0.9997 | 0.9976 | 0.8575 | 0.9742 | 1.2708 | 1.2669 | | Background_Matting | 4 | 0.9997 | 1.0192 | 0.8909 | 0.0 | 1.236 | 1.2211 | | pytorch_stargan | 16 | 0.9974 | 1.052 | 0.9019 | 0.0 | 1.2311 | 1.2059 | | alexnet | 128 | 0.9981 | 0.9976 | 0.8147 | 0.9339 | 1.2097 | 1.2086 | | hf_Reformer | 4 | 0.9958 | 0.9903 | 0.9367 | 0.0 | 1.1592 | 1.1453 | | timm_vision_transformer_large | 8 | 1.0 | 0.9907 | 0.0 | 0.0 | 1.1589 | 1.1387 | | hf_BigBird | 2 | 0.9896 | 0.9111 | 1.0556 | 0.0 | 1.1426 | 1.0263 | | yolov3 | 16 | 0.9996 | 0.9895 | 0.7999 | 0.0 | 1.0916 | 1.0669 | | tts_angular | 64 | 0.9807 | 0.9398 | 0.9872 | 0.9436 | 1.0016 | 1.0275 | | demucs | 4 | 1.0002 | 1.0002 | 1.0005 | 0.9998 | 1.0004 | 1.0002 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9955 | 0.6966 | 0.9617 | 0.9896 | 1.0301 | | drq | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_regnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9761 | 0.7311 | 0.9467 | 0.0 | 0.0 | 0.8577 | | DALLE2_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | BERT_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.9535 | 0.8692 | 0.8828 | 0.0 | 0.0 | 0.0 | | dlrm | 2048 | 0.9746 | 1.0811 | 0.0 | 0.9951 | 0.0 | 1.0103 | | hf_GPT2_large | 4 | 1.0003 | 0.9903 | 0.0 | 0.0 | 0.0 | 1.871 | | timm_vovnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | mobilenet_v3_large | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | | DALLE2_pytorch | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | drq | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9447 | 11.5267 | 15.4753 | nan | 411.7791 | 413.9552 | | hf_T5_large | 2 | 14.362 | 55.4962 | nan | nan | 143.8094 | 139.4429 | | timm_vision_transformer_large | 8 | 2.8636 | 21.9394 | nan | nan | 86.4351 | 82.7342 | | timm_resnest | 32 | 0.6057 | 3.8769 | 5.0387 | nan | 65.0707 | 65.4062 | | densenet121 | 4 | 2.2732 | 18.2262 | 26.3447 | nan | 54.1992 | 52.4994 | | timm_vision_transformer | 8 | 0.9501 | 6.61 | 8.6145 | nan | 52.7048 | 52.4864 | | pytorch_stargan | 16 | 0.4239 | 3.0894 | 3.9668 | nan | 51.2057 | 47.1337 | | hf_BigBird | 2 | 8.3083 | 18.0849 | 37.989 | nan | 49.224 | 32.3381 | | attention_is_all_you_need_pytorch | 256 | 1.3244 | 10.5292 | nan | nan | 47.6948 | 47.1628 | | pytorch_struct | 200 | 0.2784 | 1.2253 | 1.8537 | 8.7754 | 42.862 | 33.3088 | | hf_Bart | 4 | 1.8026 | 12.669 | nan | nan | 36.6955 | 35.5678 | | hf_T5 | 8 | 2.3967 | 12.9731 | nan | nan | 34.4738 | 33.9739 | | fastNLP_Bert | 6 | 1.7812 | 10.2124 | 14.5804 | nan | 33.8739 | 30.7211 | | timm_nfnet | 128 | 2.0615 | 9.9659 | nan | 160.3832 | 33.5235 | 32.8556 | | mobilenet_v3_large | 32 | 0.9679 | 7.0272 | 9.5167 | nan | 32.7513 | 32.1524 | | speech_transformer | 32 | 1.9677 | 12.7251 | 84.768 | nan | 30.4533 | 31.0915 | | hf_Reformer | 4 | 2.5252 | 5.5078 | 10.0511 | nan | 27.0294 | 21.8501 | | hf_Albert | 8 | 1.3803 | 9.3822 | 13.3305 | nan | 25.6159 | 24.6744 | | mnasnet1_0 | 32 | 0.8775 | 6.54 | 8.6668 | nan | 25.1872 | 24.6843 | | resnet50 | 32 | 0.9313 | 6.9427 | 9.3825 | nan | 24.7699 | 24.0866 | | resnext50_32x4d | 8 | 0.9849 | 6.9276 | 9.1832 | nan | 24.6322 | 24.4558 | | hf_GPT2 | 4 | 1.5437 | 8.7046 | nan | 90.9491 | 23.9999 | 23.313 | | hf_Bert | 4 | 1.6311 | 9.7812 | 13.0458 | nan | 23.2266 | 22.9881 | | shufflenet_v2_x1_0 | 128 | 0.9793 | 7.4444 | 10.2677 | nan | 22.1722 | 21.5714 | | Background_Matting | 4 | 0.9823 | 6.5773 | 9.083 | nan | 21.1504 | 19.82 | | Super_SloMo | 6 | 1.0542 | 6.4037 | 8.6658 | nan | 20.8361 | 20.3062 | | mobilenet_v2 | 96 | 0.8824 | 6.4294 | 8.8595 | nan | 20.3884 | 20.3261 | | functorch_dp_cifar10 | 64 | 0.385 | 2.8544 | 3.8912 | nan | 17.7954 | 17.5673 | | hf_DistilBert | 8 | 0.6088 | 4.763 | 9.0096 | nan | 16.2836 | 16.1508 | | resnet18 | 16 | 0.443 | 2.6727 | 3.5539 | nan | 15.0283 | 14.892 | | pytorch_unet | 1 | 0.4599 | 3.0059 | 4.0113 | nan | 9.9505 | 9.9264 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.435 | 3.1634 | 4.1872 | nan | 9.949 | 9.842 | | LearningToPaint | 96 | 0.4576 | 2.7518 | 3.7662 | nan | 8.641 | 8.7077 | | squeezenet1_1 | 32 | 0.2615 | 1.4959 | 2.1021 | 7.9345 | 5.1721 | 5.0823 | | vgg16 | 64 | 0.1812 | 1.0364 | 1.474 | 5.6569 | 4.3625 | 4.1567 | | nvidia_deeprecommender | 256 | 0.2228 | 0.6685 | 0.968 | 6.3935 | 4.0135 | 3.693 | | soft_actor_critic | 256 | 0.214 | 0.4648 | 0.711 | 3.6907 | 3.5542 | 3.0189 | | alexnet | 128 | 0.1778 | 0.6443 | 0.921 | 5.3029 | 3.4069 | 3.2558 | | dcgan | 32 | 0.1748 | 0.5882 | 0.8289 | nan | 3.0865 | 2.8866 | | lennard_jones | 1000 | 0.1569 | 0.4687 | 0.648 | 3.9535 | 2.2815 | 2.0527 | | tts_angular | 64 | 0.2088 | 0.3029 | 0.4441 | 1.4765 | 1.9548 | 1.742 | | demucs | 4 | 0.3502 | 0.3529 | 0.3508 | 0.3478 | 0.2914 | 0.2625 | | tacotron2 | 64 | 18.1692 | 34.9294 | 56.6561 | nan | nan | 70.0641 | | hf_GPT2_large | 4 | 5.4446 | 27.549 | nan | nan | nan | 60.7474 | | dlrm | 2048 | 0.4679 | 1.0799 | nan | 5.6692 | nan | 3.5648 | | hf_Longformer | 2 | 6.2969 | 17.513 | 86.0584 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | nan | nan | | drq | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 0.9814 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3318 | nan | 1.1102 | 1.1145 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0536 | 1.1475 | | timm_nfnet | 128 | 0.9691 | 0.8985 | nan | 0.7873 | 1.0337 | 1.124 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 1.0179 | 1.1759 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.312 | nan | 1.0072 | 1.0234 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | 0.8601 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0017 | 0.8736 | 0.4235 | nan | 0.9282 | 0.9863 | | Background_Matting | 4 | 1.0059 | 0.9548 | 0.3708 | nan | 0.9242 | 0.929 | | yolov3 | 16 | 0.985 | 0.8338 | 0.3517 | nan | 0.894 | 0.899 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | nan | 0.8877 | 1.2496 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8357 | nan | nan | 0.879 | 0.9543 | | timm_resnest | 32 | 0.9875 | 0.8721 | 0.3483 | nan | 0.876 | 0.9969 | | densenet121 | 4 | 0.9883 | 0.866 | 0.3671 | nan | 0.876 | 0.9569 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3902 | nan | 0.8735 | 0.942 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3573 | nan | 0.8678 | 0.8715 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | resnet50 | 32 | 0.9888 | 0.8617 | 0.3556 | nan | 0.8647 | 0.8839 | | squeezenet1_1 | 32 | 0.9595 | 0.7951 | 0.346 | 0.5757 | 0.8607 | 0.8929 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8419 | 0.3592 | nan | 0.8597 | 0.8957 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3413 | nan | 0.8384 | 0.9049 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | nan | 0.8283 | 0.8738 | | hf_Bart | 4 | 0.9102 | 0.831 | nan | nan | 0.8226 | 0.9758 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4539 | nan | 0.8111 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4794 | 0.7444 | 0.7973 | 0.9099 | | mobilenet_v3_large | 32 | 0.9776 | 0.8503 | 0.3454 | nan | 0.7902 | 0.816 | | pytorch_stargan | 16 | 0.9952 | 0.9721 | 0.4271 | nan | 0.782 | 0.8863 | | resnext50_32x4d | 8 | 0.9947 | 0.8545 | 0.3878 | nan | 0.7622 | 0.7746 | | mnasnet1_0 | 32 | 0.9788 | 0.8617 | 0.3408 | nan | 0.7529 | 0.7734 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.6125 | 0.7491 | 0.7534 | | LearningToPaint | 96 | 0.9245 | 0.7232 | 0.3854 | nan | 0.7365 | 0.8099 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3917 | nan | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3944 | nan | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4307 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1353 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | 0.7306 | | tacotron2 | 64 | 0.9866 | 0.3963 | 0.3143 | nan | nan | 0.4113 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3491 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | DALLE2_pytorch | 0 | nan | nan | nan | nan | nan | nan | | drq | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 32 | 1.0532 | 0.8144 | 0.0 | 0.0 | 5.2202 | 1.7696 | | YituTechConvBert | 1 | 1.0238 | 0.8193 | 0.0 | 0.0 | 4.547 | 1.6355 | | CamemBert | 1 | 1.0414 | 0.8239 | 1.7254 | 0.0 | 4.3333 | 1.7607 | | MobileBertForQuestionAnswering | 64 | 1.013 | 0.8132 | 0.0 | 0.0 | 3.8005 | 1.7507 | | MT5ForConditionalGeneration | 8 | 1.0214 | 0.831 | 0.0 | 0.0 | 3.6256 | 2.2867 | | DistillGPT2 | 1 | 1.0309 | 0.8636 | 1.2921 | 0.2459 | 3.1802 | 1.9763 | | M2M100ForConditionalGeneration | 8 | 1.0923 | 0.8027 | 1.2636 | 0.0 | 2.6778 | 1.8621 | | MegatronBertForCausalLM | 16 | 1.0328 | 0.8411 | 0.9768 | 0.0 | 2.3439 | 1.7643 | | GPT2ForSequenceClassification | 4 | 1.0011 | 0.9794 | 0.0 | 0.5006 | 2.3191 | 2.2774 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9796 | 0.7659 | 0.0 | 2.121 | 2.0681 | | MegatronBertForQuestionAnswering | 16 | 1.0304 | 0.8461 | 1.0558 | 0.0 | 2.0172 | 1.7413 | | PLBartForConditionalGeneration | 16 | 1.0136 | 0.8241 | 0.0 | 0.0 | 1.9828 | 1.7666 | | T5Small | 1 | 1.0257 | 0.8421 | 0.0 | 0.0 | 1.9009 | 1.3747 | | ElectraForCausalLM | 32 | 1.0002 | 0.9321 | 0.7143 | 0.0 | 1.8866 | 1.8909 | | LayoutLMForSequenceClassification | 16 | 1.0005 | 0.9806 | 0.7754 | 0.0 | 1.8748 | 1.8037 | | XGLMForCausalLM | 8 | 1.0107 | 0.817 | 1.0837 | 0.0 | 1.802 | 1.6247 | | MBartForConditionalGeneration | 16 | 1.0141 | 0.8303 | 0.0 | 0.0 | 1.6965 | 1.6055 | | PegasusForConditionalGeneration | 16 | 1.0096 | 0.8247 | 1.0895 | 0.0 | 1.6742 | 1.5387 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9714 | 0.7566 | 0.0 | 1.6717 | 1.646 | | AlbertForQuestionAnswering | 4 | 1.0001 | 0.8858 | 0.0 | 0.0 | 1.671 | 1.66 | | AlbertForMaskedLM | 4 | 1.0001 | 0.8852 | 0.0 | 0.0 | 1.6677 | 1.6589 | | Speech2Text2ForCausalLM | 128 | 1.0035 | 0.9446 | 0.7224 | 0.0 | 1.6143 | 1.5948 | | OPTForCausalLM | 32 | 1.0099 | 0.9263 | 0.0 | 0.0 | 1.5996 | 1.5584 | | RobertaForQuestionAnswering | 128 | 0.9997 | 0.9737 | 0.7715 | 0.0 | 1.5102 | 1.4653 | | DebertaForMaskedLM | 4 | 0.9091 | 0.7242 | 0.794 | 0.0 | 1.5038 | 1.1765 | | BertForQuestionAnswering | 128 | 1.0 | 0.9842 | 0.7781 | 0.0 | 1.4995 | 1.4732 | | DistilBertForQuestionAnswering | 64 | 0.9998 | 0.9702 | 0.7414 | 0.0 | 1.4924 | 1.4498 | | BartForCausalLM | 4 | 1.0006 | 0.9699 | 0.0 | 0.0 | 1.4565 | 1.4544 | | RobertaForCausalLM | 64 | 1.0002 | 0.9597 | 0.753 | 0.0 | 1.4561 | 1.4443 | | BartForConditionalGeneration | 2 | 1.0051 | 0.9697 | 0.0 | 0.0 | 1.4521 | 1.4284 | | BlenderbotSmallForConditionalGeneration | 64 | 1.006 | 0.881 | 0.0 | 0.0 | 1.4457 | 1.4293 | | T5ForConditionalGeneration | 4 | 0.9988 | 0.8478 | 0.0 | 0.0 | 1.4316 | 1.4066 | | BertForMaskedLM | 64 | 0.9995 | 0.9568 | 0.741 | 0.0 | 1.3698 | 1.3576 | | PLBartForCausalLM | 32 | 1.0067 | 0.9413 | 0.7949 | 0.0 | 1.3212 | 1.3123 | | DistilBertForMaskedLM | 64 | 1.0007 | 0.9394 | 0.7091 | 0.0 | 1.2923 | 1.2949 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9251 | 0.0 | 0.0 | 1.2851 | 1.3029 | | TrOCRForCausalLM | 32 | 1.0014 | 0.9487 | 0.0 | 0.0 | 1.2082 | 1.2081 | | MBartForCausalLM | 32 | 1.0015 | 0.9405 | 0.0 | 0.0 | 1.2048 | 1.2125 | | PegasusForCausalLM | 32 | 1.0017 | 0.9531 | 0.7493 | 0.0 | 1.1999 | 1.2006 | | DebertaForQuestionAnswering | 8 | 0.9922 | 0.878 | 0.7224 | 0.0 | 1.1553 | 1.231 | | BigBird | 1 | 0.9897 | 0.9043 | 1.0451 | 0.0 | 1.1547 | 1.033 | | AllenaiLongformerBase | 1 | 0.9459 | 0.7173 | 0.9232 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.1337 | 13.387 | 48.9324 | nan | 114.5344 | 42.733 | | DebertaForMaskedLM | 4 | 5.0148 | 13.4971 | 48.8503 | nan | 111.8449 | 40.7341 | | XGLMForCausalLM | 8 | 2.9011 | 19.3323 | 62.2137 | nan | 110.9551 | 107.3247 | | M2M100ForConditionalGeneration | 8 | 3.2545 | 24.9489 | 35.1375 | nan | 91.5421 | 83.7799 | | MobileBertForMaskedLM | 32 | 9.4309 | 44.8018 | nan | nan | 82.6608 | 79.4414 | | MobileBertForQuestionAnswering | 64 | 9.5672 | 44.5644 | nan | nan | 80.9801 | 77.5218 | | PegasusForConditionalGeneration | 16 | 3.3515 | 23.8904 | 36.6976 | nan | 61.5306 | 57.2035 | | YituTechConvBert | 1 | 2.5675 | 15.0827 | nan | nan | 60.5289 | 56.7155 | | MBartForConditionalGeneration | 16 | 3.5779 | 24.514 | nan | nan | 59.1477 | 57.3458 | | BartForConditionalGeneration | 2 | 3.5058 | 23.9259 | nan | nan | 57.9283 | 59.1563 | | BigBird | 1 | 8.2351 | 17.9581 | 38.1998 | nan | 49.9917 | 32.3107 | | MegatronBertForCausalLM | 16 | 3.5032 | 20.9295 | 28.7282 | nan | 48.1106 | 46.5951 | | MegatronBertForQuestionAnswering | 16 | 3.5658 | 19.8299 | 28.5883 | nan | 47.4674 | 47.7053 | | MT5ForConditionalGeneration | 8 | 3.7251 | 19.9272 | nan | nan | 45.3998 | 44.7219 | | BlenderbotSmallForConditionalGeneration | 64 | 2.1897 | 16.0695 | nan | nan | 41.8997 | 39.7041 | | T5Small | 1 | 2.3742 | 12.9028 | nan | nan | 40.8421 | 40.8512 | | PLBartForConditionalGeneration | 16 | 1.8048 | 12.4558 | nan | nan | 37.099 | 35.8492 | | LayoutLMForSequenceClassification | 16 | 2.0685 | 10.4133 | 14.7917 | nan | 36.2353 | 35.0163 | | T5ForConditionalGeneration | 4 | 2.4168 | 12.8866 | nan | nan | 36.158 | 35.466 | | ElectraForCausalLM | 32 | 1.7389 | 10.0882 | 14.325 | nan | 32.4045 | 29.8567 | | PegasusForCausalLM | 32 | 1.2912 | 9.2042 | 13.5113 | nan | 29.3399 | 27.6953 | | MBartForCausalLM | 32 | 1.2556 | 9.1803 | nan | nan | 28.1514 | 26.5347 | | TrOCRForCausalLM | 32 | 1.2149 | 9.162 | nan | nan | 26.9633 | 25.9731 | | LayoutLMForMaskedLM | 16 | 2.0929 | 10.5512 | 14.7817 | nan | 26.6906 | 25.3816 | | BartForCausalLM | 4 | 1.3015 | 9.2803 | nan | nan | 26.5968 | 25.6657 | | BertForMaskedLM | 64 | 1.5932 | 9.7999 | 13.5347 | nan | 26.3467 | 25.2186 | | OPTForCausalLM | 32 | 1.3145 | 9.3697 | nan | nan | 26.3316 | 25.525 | | ElectraForQuestionAnswering | 64 | 1.7198 | 10.0508 | 14.6285 | nan | 26.0891 | 25.265 | | RobertaForCausalLM | 64 | 1.6341 | 10.6038 | 13.9044 | nan | 24.8666 | 23.9087 | | BertForQuestionAnswering | 128 | 1.5895 | 9.8667 | 13.5474 | nan | 24.4399 | 23.7781 | | CamemBert | 1 | 1.673 | 10.1184 | 13.1708 | nan | 23.8968 | 23.2641 | | GPT2ForSequenceClassification | 4 | 1.5389 | 8.9736 | nan | 85.9433 | 23.5329 | 22.8334 | | RobertaForQuestionAnswering | 128 | 1.63 | 9.9709 | 13.7999 | nan | 23.2123 | 22.3223 | | AlbertForMaskedLM | 4 | 1.4586 | 9.6474 | nan | nan | 22.6361 | 21.1682 | | AlbertForQuestionAnswering | 4 | 1.3101 | 9.5103 | nan | nan | 22.1687 | 20.6836 | | BlenderbotSmallForCausalLM | 64 | 0.8463 | 6.0616 | nan | nan | 21.5948 | 20.9252 | | Speech2Text2ForCausalLM | 128 | 0.7537 | 4.7464 | 7.1436 | nan | 18.9158 | 17.7977 | | PLBartForCausalLM | 32 | 0.6531 | 4.7334 | 6.6982 | nan | 18.2356 | 17.6639 | | DistillGPT2 | 1 | 0.7706 | 4.5961 | 6.0364 | 49.682 | 18.1949 | 18.2102 | | DistilBertForMaskedLM | 64 | 0.5928 | 4.7838 | 8.8644 | nan | 15.7986 | 15.0615 | | DistilBertForQuestionAnswering | 64 | 0.6918 | 4.8264 | 9.2264 | nan | 15.2322 | 14.2533 | | AllenaiLongformerBase | 1 | 6.7894 | 18.8195 | 89.3955 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.1078 | 1.5319 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0943 | 1.1562 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | 0.8823 | 1.0775 | 1.1632 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 1.0189 | 1.0913 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9913 | 1.1976 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9868 | 1.0636 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9866 | 1.0264 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | nan | 0.9834 | 1.0756 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.9803 | 1.0362 | | RobertaForCausalLM | 64 | 0.9991 | 0.8994 | 0.3788 | nan | 0.9794 | 1.0352 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.9642 | 1.0376 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.884 | nan | nan | 0.9579 | 1.1475 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | nan | 0.9503 | 1.2292 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.3635 | nan | 0.9481 | 1.0273 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.946 | 1.0791 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9335 | 1.0986 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.927 | 1.0177 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | 1.0441 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3469 | nan | 0.9267 | 1.0655 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | nan | 0.9217 | 1.0168 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.919 | 0.919 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.962 | 0.4377 | nan | 0.9159 | 1.0993 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.9036 | 1.0275 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | nan | nan | 0.8835 | 1.0295 | | BigBird | 1 | 1.0008 | 0.9547 | 0.448 | nan | 0.8348 | 1.1052 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8333 | 1.0324 | | DistillGPT2 | 1 | 0.9963 | 0.7984 | 0.4007 | 0.7469 | 0.817 | 1.0175 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8153 | 0.931 | | YituTechConvBert | 1 | 0.9718 | 0.8648 | nan | nan | 0.7974 | 0.9279 | | M2M100ForConditionalGeneration | 8 | 1.0094 | 0.9401 | 0.4439 | nan | 0.7672 | 1.0571 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6698 | 0.8915 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3623 | nan | 0.4154 | 1.1123 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9473 | 0.3853 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9998 | 0.998 | 0.0 | 0.0 | 2.1332 | 2.0922 | | xcit_large_24_p8_224 | 5 | 1.0024 | 0.0 | 0.0 | 0.0 | 1.8409 | 1.7255 | | twins_pcpvt_base | 64 | 1.0032 | 0.9118 | 0.8971 | 0.0 | 1.7853 | 1.6869 | | volo_d1_224 | 64 | 0.9997 | 0.9945 | 0.0 | 0.0 | 1.5965 | 1.5647 | | dla102 | 128 | 0.9998 | 0.9961 | 0.8377 | 0.0 | 1.581 | 1.549 | | nfnet_l0 | 128 | 0.9985 | 0.8094 | 0.7133 | 0.9585 | 1.5452 | 1.4672 | | cait_m36_384 | 4 | 1.0002 | 1.0044 | 0.0 | 0.0 | 1.5421 | 1.4167 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9589 | 0.0 | 0.0 | 1.5415 | 1.4788 | | hrnet_w18 | 128 | 1.0 | 0.9936 | 0.8377 | 0.0 | 1.5384 | 1.4634 | | resnest101e | 64 | 1.0 | 0.9926 | 0.8189 | 0.0 | 1.5342 | 1.4314 | | coat_lite_mini | 128 | 0.9998 | 0.9771 | 0.8596 | 0.0 | 1.5269 | 1.5086 | | gmlp_s16_224 | 128 | 0.9999 | 0.9437 | 0.0 | 0.0 | 1.5189 | 1.4737 | | gmixer_24_224 | 128 | 0.9998 | 0.8432 | 0.0 | 0.0 | 1.5152 | 1.4799 | | dm_nfnet_f0 | 128 | 0.999 | 0.998 | 0.0 | 1.1212 | 1.5058 | 1.4321 | | gluon_inception_v3 | 128 | 0.9999 | 0.9965 | 0.8538 | 0.0 | 1.5055 | 1.4718 | | adv_inception_v3 | 128 | 0.9999 | 0.9932 | 0.8536 | 0.0 | 1.502 | 1.4734 | | inception_v3 | 128 | 0.9998 | 0.9961 | 0.8535 | 0.0 | 1.5015 | 1.4633 | | convnext_base | 64 | 0.9997 | 0.9972 | 0.0 | 0.0 | 1.4806 | 1.4278 | | res2net50_14w_8s | 128 | 0.9998 | 0.9945 | 0.8105 | 0.0 | 1.4664 | 1.4044 | | crossvit_9_240 | 128 | 0.9996 | 0.9932 | 0.8378 | 0.0 | 1.447 | 1.4156 | | selecsls42b | 128 | 0.9998 | 0.9959 | 0.841 | 0.0 | 1.4433 | 1.4135 | | res2next50 | 128 | 0.9995 | 0.9962 | 0.8318 | 0.0 | 1.4127 | 1.3474 | | jx_nest_base | 32 | 0.9999 | 0.9939 | 0.0 | 0.0 | 1.3874 | 1.3495 | | convit_base | 64 | 1.0001 | 0.997 | 0.0 | 0.0 | 1.3734 | 1.3229 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.8078 | 0.0 | 1.3261 | 1.2955 | | pit_b_224 | 64 | 0.9998 | 0.9958 | 0.8212 | 0.0 | 1.3215 | 1.3173 | | res2net101_26w_4s | 64 | 1.0027 | 1.0063 | 0.7825 | 0.0 | 1.2871 | 1.2242 | | beit_base_patch16_224 | 64 | 1.0001 | 0.977 | 0.0 | 0.0 | 1.2836 | 1.272 | | mixer_b16_224 | 128 | 0.9999 | 0.9586 | 0.7776 | 0.0 | 1.283 | 1.2651 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.992 | 0.7971 | 0.0 | 1.2812 | 1.2651 | | visformer_small | 128 | 0.9999 | 1.0022 | 0.8409 | 0.0 | 1.2415 | 1.1848 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9943 | 0.8346 | 0.0 | 1.1961 | 1.1843 | | gluon_xception65 | 32 | 0.9993 | 0.9889 | 0.7545 | 0.0 | 1.1581 | 1.1266 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9847 | 0.818 | 0.0 | 1.1458 | 1.0707 | | resmlp_12_224 | 128 | 1.0005 | 1.0055 | 0.7882 | 0.0 | 1.1146 | 1.0902 | | convmixer_768_32 | 32 | 0.9999 | 0.9981 | 0.9233 | 0.0 | 1.0555 | 1.0508 | | repvgg_a2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv3_large_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilevit_s | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | pnasnet5large | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | regnety_002 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | sebotnet33ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | rexnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | spnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_efficientnet_b0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv2_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | lcnet_050 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ghostnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | gernet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetv3_b | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetc_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ese_vovnet19b_dw | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_halonext26ts | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_botnext26ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | dpn107 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | cspdarknet53 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | botnet26t_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tinynet_a | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | pit_b_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | fail_to_run | pass | pass | | tinynet_a | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.7806 | 21.2707 | 33.9545 | nan | 149.5784 | 145.2432 | | hrnet_w18 | 128 | 6.6016 | 44.576 | 75.5388 | nan | 130.6107 | 128.2492 | | swin_base_patch4_window7_224 | 64 | 3.0262 | 18.3365 | nan | nan | 99.1105 | 96.2692 | | xcit_large_24_p8_224 | 5 | 3.311 | nan | nan | nan | 97.6525 | 94.9204 | | convnext_base | 64 | 1.5648 | 10.7544 | nan | nan | 91.5385 | 91.2244 | | cait_m36_384 | 4 | 3.5714 | 28.0631 | nan | nan | 88.4667 | 82.7375 | | resnest101e | 64 | 3.4383 | 23.956 | 37.8557 | nan | 86.0647 | 86.3579 | | jx_nest_base | 32 | 1.8132 | 13.4438 | nan | nan | 76.2471 | 73.6579 | | coat_lite_mini | 128 | 1.209 | 7.6796 | 10.9335 | nan | 72.7952 | 72.8164 | | res2net101_26w_4s | 64 | 3.4138 | 24.384 | 36.9988 | nan | 69.3431 | 66.1241 | | res2net50_14w_8s | 128 | 2.8777 | 21.3707 | 32.3919 | nan | 62.7107 | 59.4905 | | gmlp_s16_224 | 128 | 1.3563 | 10.6405 | nan | nan | 53.3731 | 51.9517 | | poolformer_m36 | 64 | 1.9913 | 12.5642 | 18.5196 | nan | 52.8184 | 49.5716 | | crossvit_9_240 | 128 | 1.6898 | 12.4566 | 17.566 | nan | 52.267 | 50.8153 | | volo_d1_224 | 64 | 1.3581 | 11.0579 | nan | nan | 49.5161 | 46.2842 | | gluon_xception65 | 32 | 2.1221 | 15.7371 | 23.6177 | nan | 48.4982 | 45.7509 | | tnt_s_patch16_224 | 128 | 1.8551 | 15.5623 | nan | nan | 46.5381 | 44.4174 | | gmixer_24_224 | 128 | 1.5213 | 12.1193 | nan | nan | 42.4744 | 40.5205 | | adv_inception_v3 | 128 | 1.7198 | 12.6633 | 18.1988 | nan | 42.4164 | 38.3995 | | gluon_inception_v3 | 128 | 1.6751 | 12.7174 | 18.3739 | nan | 41.6608 | 38.6246 | | inception_v3 | 128 | 1.6841 | 12.6013 | 18.2059 | nan | 41.2244 | 39.7411 | | swsl_resnext101_32x16d | 32 | 1.9091 | 13.8692 | 20.1058 | nan | 41.1979 | 38.8514 | | dla102 | 128 | 1.9406 | 14.3637 | 20.7291 | nan | 39.8855 | 39.112 | | res2next50 | 128 | 1.8341 | 12.1762 | 17.6084 | nan | 35.7217 | 33.628 | | convit_base | 64 | 1.2536 | 8.9109 | nan | nan | 35.4171 | 34.4447 | | dm_nfnet_f0 | 128 | 2.3773 | 10.1632 | nan | 162.2244 | 34.8661 | 33.1822 | | convmixer_768_32 | 32 | 1.3356 | 9.3519 | 13.0687 | nan | 30.5793 | 28.9793 | | visformer_small | 128 | 0.9533 | 5.6058 | 8.4093 | nan | 29.7243 | 27.5589 | | mixer_b16_224 | 128 | 0.815 | 5.3796 | 8.7152 | nan | 28.7932 | 27.8077 | | resmlp_12_224 | 128 | 0.7936 | 4.4122 | 8.1227 | nan | 28.4168 | 27.3337 | | deit_base_distilled_patch16_224 | 64 | 0.9722 | 6.8066 | 9.5135 | nan | 28.1776 | 27.004 | | vit_base_patch16_224 | 64 | 1.0905 | 6.7565 | 9.8134 | nan | 27.6827 | 26.621 | | nfnet_l0 | 128 | 1.917 | 10.2543 | 14.11 | 151.0553 | 26.4034 | 24.9699 | | beit_base_patch16_224 | 64 | 1.3696 | 7.998 | nan | nan | 26.0396 | 24.1836 | | pit_b_224 | 64 | 1.1532 | 7.9184 | 11.2192 | nan | 23.9553 | 22.7493 | | selecsls42b | 128 | 0.854 | 5.6797 | 7.8746 | nan | 19.4256 | 18.5129 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9248 | nan | nan | 1.3102 | 1.3732 | | gmlp_s16_224 | 128 | 0.9938 | 0.9495 | nan | nan | 1.2842 | 1.2998 | | poolformer_m36 | 64 | 0.9983 | 0.9433 | 0.3413 | nan | 1.1018 | 1.1171 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0829 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3474 | nan | 1.056 | 1.0625 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | nan | 1.0378 | 1.1081 | | dm_nfnet_f0 | 128 | 0.9692 | 0.8981 | nan | 0.7871 | 1.0336 | 1.1239 | | nfnet_l0 | 128 | 0.9887 | 0.8167 | 0.2678 | 0.5242 | 1.0318 | 1.0749 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | nan | 0.9907 | 1.2278 | | convmixer_768_32 | 32 | 0.9972 | 0.9785 | 0.3447 | nan | 0.9759 | 0.9792 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9749 | 1.0803 | | visformer_small | 128 | 0.9897 | 0.9255 | 0.3467 | nan | 0.9613 | 1.0515 | | dla102 | 128 | 0.9684 | 0.9114 | 0.3365 | nan | 0.9431 | 0.9501 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9932 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9775 | | swsl_resnext101_32x16d | 32 | 0.9988 | 0.8771 | 0.3667 | nan | 0.9093 | 0.9339 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | nan | 0.9073 | 0.9799 | | res2net101_26w_4s | 64 | 0.994 | 0.9149 | 0.3339 | nan | 0.8973 | 0.9223 | | hrnet_w18 | 128 | 0.9914 | 0.9175 | 0.3348 | nan | 0.8966 | 0.9381 | | selecsls42b | 128 | 0.9796 | 0.8773 | 0.3532 | nan | 0.892 | 0.918 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | nan | 0.8916 | 0.8968 | | gluon_xception65 | 32 | 0.9955 | 0.8848 | 0.3345 | nan | 0.8914 | 0.8961 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | nan | 0.8911 | 0.8966 | | adv_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | nan | 0.8838 | 0.8985 | | inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | nan | 0.8838 | 0.8985 | | gluon_inception_v3 | 128 | 0.9816 | 0.862 | 0.3341 | nan | 0.8838 | 0.8985 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | nan | 0.8765 | 0.8997 | | convnext_base | 64 | 1.003 | 0.9263 | nan | nan | 0.8763 | 0.9864 | | res2next50 | 128 | 0.991 | 0.9094 | 0.32 | nan | 0.871 | 0.8961 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | nan | 0.8174 | 1.0974 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | nan | 0.8033 | 1.0354 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | nan | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | nan | 0.6707 | 0.8618 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/lVx4Thk.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/7rLnF6S.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/gdOkLBS.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 85%, 47/55 | 100%, 43/43 | 59%, 36/61  |
|       aot_eager        | 82%, 45/55 | 100%, 43/43 | 56%, 34/61  |
|     aot_cudagraphs     | 67%, 37/55 | 49%, 21/43  |  11%, 7/61  |
|    nvprims_nvfuser     | 49%, 27/55 |  5%, 2/43   | 16%, 10/61  |
|        inductor        | 75%, 41/55 | 93%, 40/43  | 56%, 34/61  |
| inductor_no_cudagraphs | 80%, 44/55 | 93%, 40/43  | 56%, 34/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.06x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.03x    |    1.00x    |    1.15x    |
|        inductor        |   1.40x    |    1.30x    |    1.23x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.22x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.81    |    2.30     |    1.67     |
|       aot_eager        |    6.93    |    9.94     |    10.22    |
|     aot_cudagraphs     |    9.22    |    20.51    |    12.16    |
|    nvprims_nvfuser     |   58.42    |    59.86    |   153.93    |
|        inductor        |   25.09    |    34.84    |    45.13    |
| inductor_no_cudagraphs |   25.58    |    29.60    |    43.93    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.90x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.82x    |    0.82x    |    0.86x    |
|        inductor        |   0.81x    |    0.71x    |    0.96x    |
| inductor_no_cudagraphs |   0.97x    |    0.96x    |    1.05x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0 | 1.0002 | 1.729 | 0.6962 | 4.0995 | 1.4091 | | timm_vision_transformer | 8 | 1.0064 | 0.9272 | 1.5117 | 0.0 | 2.6028 | 1.4017 | | functorch_dp_cifar10 | 64 | 1.0093 | 0.9463 | 1.4299 | 0.0 | 2.5495 | 1.3447 | | drq | 1 | 0.9933 | 0.8152 | 1.2908 | 0.0 | 1.9336 | 1.1021 | | pytorch_struct | 200 | 0.9952 | 0.7417 | 0.8838 | 0.7996 | 1.8019 | 1.1535 | | lennard_jones | 1000 | 0.9602 | 0.821 | 1.0399 | 0.6741 | 1.7806 | 0.9357 | | mobilenet_v3_large | 32 | 1.0067 | 1.1139 | 0.8895 | 0.8486 | 1.7197 | 1.4189 | | hf_Albert | 8 | 1.0012 | 0.9971 | 0.753 | 0.0 | 1.6495 | 1.6416 | | resnext50_32x4d | 8 | 1.0024 | 1.1292 | 0.9262 | 0.7395 | 1.6246 | 1.3365 | | speech_transformer | 32 | 1.0047 | 0.9039 | 1.5195 | 0.0 | 1.5436 | 1.5497 | | timm_resnest | 32 | 0.9995 | 1.0014 | 0.8052 | 1.1751 | 1.5198 | 1.4542 | | hf_GPT2 | 4 | 1.0106 | 0.9815 | 0.7395 | 0.4041 | 1.5012 | 1.5042 | | timm_nfnet | 128 | 0.9999 | 0.9998 | 0.0 | 1.2382 | 1.4733 | 1.4243 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 0.9998 | 0.7687 | 0.9012 | 1.4483 | 1.4009 | | mobilenet_v2_quantized_qat | 96 | 1.0013 | 0.9749 | 0.0 | 1.4402 | 1.4347 | 1.429 | | mobilenet_v2 | 96 | 0.9997 | 0.9995 | 0.7308 | 0.0 | 1.4281 | 1.4099 | | fastNLP_Bert | 6 | 0.9983 | 0.9753 | 0.7535 | 0.0 | 1.4234 | 1.3961 | | hf_T5_large | 2 | 1.0237 | 0.858 | 0.0 | 0.0 | 1.4084 | 1.4062 | | soft_actor_critic | 256 | 1.0062 | 0.773 | 1.0649 | 0.6505 | 1.4032 | 0.9408 | | resnet18 | 16 | 1.0057 | 1.1044 | 0.9021 | 0.8684 | 1.4011 | 1.2541 | | resnet50_quantized_qat | 32 | 1.0014 | 0.9581 | 0.0 | 1.1964 | 1.3825 | 1.384 | | mnasnet1_0 | 32 | 0.9992 | 1.062 | 0.8048 | 0.9323 | 1.3757 | 1.2902 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9935 | 0.9377 | 0.9682 | 0.7944 | 1.3703 | 1.2631 | | squeezenet1_1 | 32 | 1.0042 | 1.0105 | 0.8161 | 0.7837 | 1.3503 | 1.2943 | | dcgan | 32 | 0.9771 | 1.0081 | 1.005 | 0.7535 | 1.3235 | 1.0438 | | LearningToPaint | 96 | 0.9998 | 1.0015 | 0.8025 | 1.0022 | 1.2134 | 1.1748 | | resnet50 | 32 | 0.9991 | 0.9925 | 0.7606 | 1.0675 | 1.2048 | 1.1684 | | hf_Bert | 4 | 1.034 | 1.0028 | 0.732 | 0.0 | 1.2039 | 1.1839 | | pytorch_unet | 1 | 0.9998 | 0.9978 | 0.8454 | 1.0956 | 1.197 | 1.1858 | | hf_Bart | 4 | 1.0117 | 0.9787 | 0.0 | 0.0 | 1.1961 | 1.1981 | | Super_SloMo | 6 | 1.0007 | 0.9976 | 0.8669 | 1.0031 | 1.1809 | 1.1662 | | hf_DistilBert | 8 | 1.0003 | 0.9571 | 0.6873 | 0.0 | 1.1749 | 1.1815 | | vgg16 | 64 | 0.9994 | 0.999 | 0.859 | 0.9981 | 1.1716 | 1.1664 | | alexnet | 128 | 0.999 | 0.9973 | 0.803 | 0.9998 | 1.1612 | 1.1625 | | Background_Matting | 4 | 1.0004 | 1.0234 | 0.8651 | 1.0795 | 1.1175 | 1.1099 | | pytorch_stargan | 16 | 0.9992 | 0.9841 | 0.857 | 0.0 | 1.1158 | 1.0956 | | hf_Reformer | 4 | 0.9967 | 0.0 | 0.9195 | 0.0 | 1.1051 | 1.129 | | yolov3 | 16 | 0.9998 | 0.9934 | 0.7934 | 1.193 | 1.0944 | 1.0786 | | hf_BigBird | 2 | 0.9898 | 0.9411 | 0.9537 | 0.0 | 1.0923 | 0.9991 | | attention_is_all_you_need_pytorch | 256 | 1.0001 | 0.9716 | 0.0 | 0.0 | 1.0626 | 1.0517 | | timm_vision_transformer_large | 8 | 0.9997 | 0.9932 | 0.0 | 0.0 | 1.0452 | 1.0321 | | tts_angular | 64 | 0.9879 | 0.9621 | 0.9915 | 0.9786 | 1.0103 | 1.0112 | | demucs | 4 | 0.9999 | 0.9997 | 0.9992 | 0.9998 | 0.9995 | 0.9996 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9639 | 0.5851 | 0.8374 | 0.9033 | 0.9636 | | dlrm | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_regnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_T5 | 8 | 1.0024 | 0.8173 | 0.0 | 0.0 | 0.0 | 1.1083 | | BERT_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.9619 | 0.8922 | 0.8148 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9742 | 0.8228 | 0.0 | 0.0 | 0.0 | 0.8932 | | hf_GPT2_large | 4 | 1.0006 | 0.9804 | 0.0 | 0.5683 | 0.0 | 1.47 | | timm_vovnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | drq | 1 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | BERT_pytorch | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | yolov3 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8115 | 9.3018 | 12.6216 | 117.0811 | 364.3524 | 366.1503 | | hf_T5_large | 2 | 13.8284 | 46.6744 | nan | nan | 124.5777 | 123.4937 | | timm_resnest | 32 | 0.5364 | 2.8301 | 3.9865 | 54.3924 | 69.1663 | 68.0954 | | timm_vision_transformer_large | 8 | 2.2733 | 15.3555 | nan | nan | 67.5839 | 65.5171 | | attention_is_all_you_need_pytorch | 256 | 1.1341 | 7.7929 | nan | nan | 56.6741 | 56.2158 | | timm_vision_transformer | 8 | 0.7677 | 4.5674 | 6.2645 | nan | 52.9313 | 51.4381 | | pytorch_stargan | 16 | 0.3714 | 2.5255 | 3.3546 | nan | 50.4344 | 49.0677 | | densenet121 | 4 | 2.0694 | 14.3433 | 21.2853 | 202.4909 | 46.6107 | 45.4469 | | pytorch_struct | 200 | 0.2366 | 0.8586 | 1.4164 | 5.9341 | 44.0005 | 40.2954 | | hf_BigBird | 2 | 7.3638 | 14.3882 | 30.209 | nan | 41.6494 | 26.6148 | | resnet50_quantized_qat | 32 | 1.0829 | 10.0375 | nan | 176.6514 | 31.8576 | 32.803 | | hf_Bart | 4 | 1.4298 | 8.9193 | nan | nan | 31.604 | 29.9235 | | timm_nfnet | 128 | 1.9477 | 8.3264 | nan | 151.2063 | 29.7096 | 29.1044 | | fastNLP_Bert | 6 | 1.4539 | 7.2734 | 10.8399 | nan | 28.9193 | 27.4635 | | mobilenet_v2_quantized_qat | 96 | 1.2261 | 9.679 | nan | 206.3932 | 28.8796 | 28.6575 | | mobilenet_v3_large | 32 | 0.8337 | 5.1244 | 7.1639 | 95.8025 | 28.8095 | 29.6084 | | speech_transformer | 32 | 1.6085 | 9.0263 | 57.4883 | nan | 28.6724 | 27.0355 | | hf_Reformer | 4 | 2.4019 | nan | 9.4234 | nan | 27.6633 | 22.0328 | | mnasnet1_0 | 32 | 0.7484 | 4.9033 | 6.778 | 70.9925 | 22.8144 | 21.9044 | | resnet50 | 32 | 0.8195 | 5.2933 | 7.2589 | 72.8228 | 22.5084 | 21.5824 | | resnext50_32x4d | 8 | 0.847 | 5.3723 | 7.2545 | 69.6596 | 21.6468 | 21.079 | | hf_Albert | 8 | 1.0311 | 6.3773 | 9.1891 | nan | 21.4849 | 20.498 | | hf_Bert | 4 | 1.3563 | 6.7947 | 9.4768 | nan | 20.368 | 19.4672 | | hf_GPT2 | 4 | 1.3115 | 6.6371 | 9.7001 | 76.3586 | 20.2952 | 19.6159 | | shufflenet_v2_x1_0 | 128 | 0.882 | 5.8295 | 7.9703 | 78.1541 | 18.7075 | 18.1636 | | Background_Matting | 4 | 0.7933 | 5.012 | 7.1829 | 60.5559 | 17.7687 | 17.4128 | | Super_SloMo | 6 | 0.9754 | 5.2045 | 6.9717 | 33.6637 | 17.6967 | 17.0155 | | mobilenet_v2 | 96 | 0.7487 | 4.8924 | 7.0243 | nan | 17.1443 | 16.7214 | | functorch_dp_cifar10 | 64 | 0.3443 | 2.2405 | 3.0641 | nan | 16.7882 | 16.7369 | | resnet18 | 16 | 0.3911 | 2.0415 | 2.8493 | 29.206 | 14.9346 | 14.2941 | | hf_DistilBert | 8 | 0.4555 | 3.2837 | 6.4668 | nan | 14.3451 | 13.6731 | | pytorch_unet | 1 | 0.416 | 2.263 | 3.1963 | 26.2599 | 8.6129 | 8.0823 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3652 | 2.4473 | 3.2338 | 24.3429 | 8.5991 | 8.2992 | | LearningToPaint | 96 | 0.4132 | 2.1193 | 3.1342 | 37.4176 | 7.4992 | 7.1922 | | squeezenet1_1 | 32 | 0.2163 | 1.0161 | 1.4466 | 5.0694 | 4.2764 | 3.9868 | | drq | 1 | 0.1394 | 0.4791 | 0.797 | nan | 4.1643 | 3.2909 | | vgg16 | 64 | 0.1754 | 0.6738 | 1.0691 | 3.9903 | 3.6631 | 3.4161 | | nvidia_deeprecommender | 256 | 0.1936 | 0.4571 | 0.7183 | 6.4774 | 3.5211 | 3.2255 | | soft_actor_critic | 256 | 0.1992 | 0.3528 | 0.5766 | 2.4802 | 3.1176 | 2.8313 | | alexnet | 128 | 0.1442 | 0.4357 | 0.6885 | 3.6951 | 3.0013 | 2.7489 | | dcgan | 32 | 0.1721 | 0.4693 | 0.6956 | 4.2809 | 2.7171 | 2.4961 | | lennard_jones | 1000 | 0.136 | 0.311 | 0.4595 | 2.208 | 2.0041 | 1.8074 | | tts_angular | 64 | 0.2055 | 0.2722 | 0.3998 | 1.0545 | 1.9346 | 1.7801 | | demucs | 4 | 0.3052 | 0.3215 | 0.3085 | 0.3066 | 0.2242 | 0.2163 | | tacotron2 | 64 | 17.4194 | 32.0181 | nan | nan | nan | 69.6038 | | hf_GPT2_large | 4 | 4.9188 | 20.7272 | nan | 281.7404 | nan | 48.0873 | | hf_T5 | 8 | 2.2572 | 10.3421 | nan | nan | nan | 29.12 | | hf_Longformer | 2 | 6.0671 | 15.1734 | 81.9406 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | dlrm | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9961 | 0.8279 | nan | 0.8271 | 1.5828 | 1.5828 | | resnet50_quantized_qat | 32 | 0.9971 | 0.9148 | nan | 0.8498 | 1.4863 | 1.4864 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | 0.9527 | 1.2029 | 1.4002 | | mobilenet_v2 | 96 | 0.9923 | 0.7624 | 0.3061 | nan | 1.1741 | 1.2826 | | squeezenet1_1 | 32 | 0.9781 | 0.8163 | 0.3371 | 0.8132 | 1.0821 | 1.1897 | | speech_transformer | 32 | 0.9997 | 0.9144 | 0.2704 | nan | 1.0385 | 1.042 | | timm_nfnet | 128 | 0.9358 | 0.8937 | nan | 0.879 | 1.0221 | 1.096 | | demucs | 4 | 0.9884 | 0.9884 | 0.9884 | 0.9888 | 0.9888 | 0.9884 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | Background_Matting | 4 | 0.9989 | 0.9483 | 0.3594 | 0.9323 | 0.9822 | 1.0384 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8142 | 0.9812 | 1.0425 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | yolov3 | 16 | 0.9893 | 0.8384 | 0.3319 | 0.8042 | 0.9175 | 1.098 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8521 | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9966 | 1.009 | 0.4104 | nan | 0.9015 | 1.0542 | | timm_resnest | 32 | 0.9926 | 0.8759 | 0.3223 | 0.7295 | 0.8947 | 0.9967 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9106 | 0.3929 | 0.8949 | 0.8868 | 1.0148 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9876 | 0.856 | 0.3277 | 0.7754 | 0.8832 | 0.8974 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 1.0 | 0.8879 | 0.3462 | 0.8612 | 0.8616 | 1.006 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8613 | 0.922 | | resnet50 | 32 | 0.9945 | 0.8704 | 0.3364 | 0.7953 | 0.8552 | 0.9335 | | mnasnet1_0 | 32 | 0.9878 | 0.8992 | 0.3334 | 0.8256 | 0.8532 | 0.8671 | | hf_Bart | 4 | 0.9617 | 0.8777 | nan | nan | 0.8504 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | nan | 0.8354 | 1.1229 | | resnext50_32x4d | 8 | 0.9961 | 0.8679 | 0.3587 | 0.8198 | 0.8278 | 0.8346 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4303 | nan | 0.8211 | 1.0392 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | nan | 0.7632 | 0.8778 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3307 | nan | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9304 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7451 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7061 | 1.0016 | | LearningToPaint | 96 | 0.9435 | 0.6929 | 0.3399 | 0.627 | 0.6926 | 0.9353 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6949 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | nan | 0.4056 | 0.4214 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | tacotron2 | 64 | 0.9906 | 1.0302 | nan | nan | nan | 1.1621 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1439 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | 0.876 | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | dlrm | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0267 | 0.8959 | 0.0 | 0.0 | 3.6869 | 1.4577 | | MT5ForConditionalGeneration | 8 | 1.0241 | 0.8803 | 0.0 | 0.0 | 2.6304 | 1.9777 | | CamemBert | 1 | 1.0465 | 0.9322 | 1.3382 | 0.0 | 2.4303 | 1.5183 | | DistillGPT2 | 1 | 1.0322 | 0.9114 | 1.0592 | 0.2897 | 2.116 | 1.8065 | | GoogleFnet | 1 | 1.0018 | 0.8065 | 0.9757 | 0.0 | 1.9193 | 1.1053 | | MobileBertForMaskedLM | 32 | 1.0215 | 0.9091 | 0.0 | 0.0 | 1.905 | 1.4607 | | GPT2ForSequenceClassification | 4 | 1.0004 | 0.9778 | 0.0 | 0.6655 | 1.7892 | 1.773 | | MobileBertForQuestionAnswering | 64 | 1.0217 | 0.9163 | 0.0 | 0.0 | 1.4585 | 1.3185 | | ElectraForQuestionAnswering | 64 | 1.0003 | 0.9836 | 0.0 | 0.0 | 1.4261 | 1.407 | | M2M100ForConditionalGeneration | 8 | 1.0103 | 0.9016 | 0.8731 | 0.0 | 1.4195 | 1.4986 | | ElectraForCausalLM | 32 | 1.0003 | 0.9331 | 0.0 | 0.0 | 1.4123 | 1.4497 | | T5Small | 1 | 1.0232 | 0.8748 | 0.0 | 0.0 | 1.3289 | 1.2052 | | LayoutLMForSequenceClassification | 16 | 1.0002 | 0.9896 | 0.738 | 0.0 | 1.307 | 1.2887 | | AlbertForQuestionAnswering | 4 | 1.0001 | 1.0009 | 0.0 | 0.0 | 1.2553 | 1.2506 | | AlbertForMaskedLM | 4 | 1.0001 | 1.0 | 0.0 | 0.0 | 1.2497 | 1.2482 | | PLBartForConditionalGeneration | 16 | 1.015 | 0.9687 | 0.0 | 0.0 | 1.2173 | 1.2107 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9709 | 0.0 | 0.0 | 1.2095 | 1.2076 | | OPTForCausalLM | 32 | 1.0027 | 0.9326 | 0.0 | 0.0 | 1.1854 | 1.2095 | | XGLMForCausalLM | 8 | 1.014 | 0.9387 | 0.737 | 0.0 | 1.1773 | 1.1914 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.9692 | 0.7131 | 0.0 | 1.1707 | 1.1496 | | RobertaForCausalLM | 64 | 1.0006 | 0.9622 | 0.7463 | 0.0 | 1.1446 | 1.1528 | | MegatronBertForQuestionAnswering | 16 | 1.0377 | 1.0069 | 0.7616 | 0.0 | 1.1416 | 1.1256 | | MegatronBertForCausalLM | 16 | 1.0349 | 1.0041 | 0.7474 | 0.0 | 1.1393 | 1.1226 | | Speech2Text2ForCausalLM | 128 | 0.9987 | 0.9279 | 0.662 | 0.0 | 1.1248 | 1.1433 | | BertForQuestionAnswering | 128 | 0.9999 | 0.9942 | 0.0 | 0.0 | 1.1132 | 1.1034 | | RobertaForQuestionAnswering | 128 | 0.9997 | 0.9931 | 0.0 | 0.0 | 1.111 | 1.103 | | BartForCausalLM | 4 | 1.0008 | 0.9664 | 0.0 | 0.0 | 1.0999 | 1.1101 | | BigBird | 1 | 0.9899 | 0.9377 | 1.0038 | 0.0 | 1.0994 | 0.999 | | PegasusForConditionalGeneration | 16 | 1.0092 | 0.9823 | 0.7669 | 0.0 | 1.0973 | 1.0856 | | BartForConditionalGeneration | 2 | 0.9999 | 0.988 | 0.0 | 0.0 | 1.0958 | 1.0886 | | MBartForConditionalGeneration | 16 | 1.0128 | 0.9869 | 0.0 | 0.0 | 1.0954 | 1.0797 | | DebertaForMaskedLM | 4 | 0.9259 | 0.795 | 0.7351 | 0.5918 | 1.0896 | 1.0669 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0009 | 0.9406 | 0.0 | 0.0 | 1.0645 | 1.0709 | | BertForMaskedLM | 64 | 1.0002 | 0.961 | 0.7261 | 0.0 | 1.0598 | 1.0657 | | DebertaForQuestionAnswering | 8 | 0.9968 | 0.9918 | 0.6839 | 0.8023 | 1.053 | 1.2217 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.9517 | 0.7116 | 0.0 | 1.0514 | 1.0685 | | PLBartForCausalLM | 32 | 1.0051 | 0.9356 | 0.7157 | 0.0 | 1.0288 | 1.0521 | | T5ForConditionalGeneration | 4 | 1.0001 | 0.8107 | 0.0 | 0.0 | 1.0186 | 1.0159 | | BlenderbotSmallForCausalLM | 64 | 1.0017 | 0.9104 | 0.6831 | 0.0 | 1.0078 | 1.046 | | TrOCRForCausalLM | 32 | 1.0009 | 0.9497 | 0.0 | 0.0 | 1.0034 | 1.0162 | | MBartForCausalLM | 32 | 1.0017 | 0.9549 | 0.0 | 0.0 | 0.9991 | 1.0098 | | PegasusForCausalLM | 32 | 0.9997 | 0.953 | 0.7336 | 0.0 | 0.9911 | 1.0019 | | AllenaiLongformerBase | 1 | 0.9451 | 0.8571 | 0.7811 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.6034 | 11.7597 | 46.1944 | 88.959 | 105.1824 | 37.3386 | | DebertaForMaskedLM | 4 | 4.7864 | 11.7235 | 45.4473 | 95.6373 | 102.0889 | 35.2126 | | XGLMForCausalLM | 8 | 2.2984 | 13.4049 | 41.1791 | nan | 93.9752 | 91.9221 | | M2M100ForConditionalGeneration | 8 | 2.8692 | 16.4204 | 26.2198 | nan | 74.1513 | 59.456 | | MobileBertForMaskedLM | 32 | 8.0772 | 29.9958 | nan | nan | 58.587 | 57.0843 | | MobileBertForQuestionAnswering | 64 | 8.1438 | 30.0607 | nan | nan | 57.8288 | 55.124 | | YituTechConvBert | 1 | 2.1342 | 10.6467 | nan | nan | 53.6193 | 51.8091 | | PegasusForConditionalGeneration | 16 | 2.6707 | 16.2343 | 25.6387 | nan | 48.3685 | 44.7784 | | BartForConditionalGeneration | 2 | 2.896 | 16.7112 | nan | nan | 48.3418 | 46.4353 | | MBartForConditionalGeneration | 16 | 2.8212 | 16.8195 | nan | nan | 47.7283 | 45.8726 | | BigBird | 1 | 7.3606 | 14.4673 | 30.3105 | nan | 40.9783 | 26.9934 | | MT5ForConditionalGeneration | 8 | 3.5678 | 14.9534 | nan | nan | 38.9148 | 38.3995 | | MegatronBertForCausalLM | 16 | 3.0746 | 13.8972 | 20.7444 | nan | 37.0113 | 35.817 | | T5Small | 1 | 2.5089 | 11.0442 | nan | nan | 36.4878 | 36.2842 | | MegatronBertForQuestionAnswering | 16 | 3.0927 | 13.8919 | 20.3408 | nan | 35.79 | 34.87 | | LayoutLMForSequenceClassification | 16 | 1.714 | 7.2499 | 10.746 | nan | 34.8362 | 34.2751 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7961 | 10.965 | nan | nan | 34.241 | 32.5713 | | T5ForConditionalGeneration | 4 | 2.3911 | 10.6689 | nan | nan | 31.4236 | 30.5327 | | PLBartForConditionalGeneration | 16 | 1.4236 | 8.6819 | nan | nan | 30.7748 | 30.034 | | ElectraForCausalLM | 32 | 1.3923 | 6.8886 | nan | nan | 29.5401 | 26.9027 | | PegasusForCausalLM | 32 | 1.0273 | 6.2578 | 9.6624 | nan | 24.0143 | 22.4895 | | LayoutLMForMaskedLM | 16 | 1.6449 | 7.2994 | nan | nan | 23.1539 | 22.1532 | | ElectraForQuestionAnswering | 64 | 1.3324 | 7.3194 | nan | nan | 22.8632 | 21.4263 | | MBartForCausalLM | 32 | 1.0726 | 6.3111 | nan | nan | 22.857 | 22.6262 | | BertForMaskedLM | 64 | 1.3561 | 6.7907 | 9.7817 | nan | 22.6856 | 21.7862 | | GoogleFnet | 1 | 0.7932 | 3.5298 | 10.6534 | nan | 22.2475 | 14.7056 | | BartForCausalLM | 4 | 1.0526 | 6.2225 | nan | nan | 21.8942 | 20.5408 | | TrOCRForCausalLM | 32 | 0.9754 | 6.4149 | nan | nan | 21.8763 | 21.2721 | | RobertaForCausalLM | 64 | 1.3915 | 6.9382 | 10.0146 | nan | 21.5907 | 20.2026 | | BertForQuestionAnswering | 128 | 1.3763 | 6.7528 | nan | nan | 21.4848 | 20.668 | | CamemBert | 1 | 1.409 | 7.0806 | 9.4529 | nan | 20.7983 | 19.6505 | | RobertaForQuestionAnswering | 128 | 1.3727 | 7.0594 | nan | nan | 20.1625 | 18.9711 | | OPTForCausalLM | 32 | 1.0811 | 6.3594 | nan | nan | 19.739 | 19.1144 | | GPT2ForSequenceClassification | 4 | 1.343 | 7.2341 | nan | 76.8789 | 18.7462 | 17.9655 | | AlbertForMaskedLM | 4 | 1.1366 | 6.5385 | nan | nan | 18.034 | 17.1896 | | BlenderbotSmallForCausalLM | 64 | 0.6175 | 4.1573 | 6.1707 | nan | 17.8778 | 18.1785 | | AlbertForQuestionAnswering | 4 | 1.142 | 6.6617 | nan | nan | 17.8668 | 16.7509 | | DistillGPT2 | 1 | 0.6665 | 3.3544 | 4.5135 | 42.8317 | 17.094 | 16.9368 | | Speech2Text2ForCausalLM | 128 | 0.5809 | 3.2784 | 5.021 | nan | 16.1798 | 15.3569 | | PLBartForCausalLM | 32 | 0.5077 | 3.2569 | 4.6715 | nan | 15.2233 | 14.8182 | | DistilBertForMaskedLM | 64 | 0.4965 | 3.2823 | 6.4371 | nan | 13.2136 | 12.7931 | | DistilBertForQuestionAnswering | 64 | 0.5193 | 3.3775 | 6.3649 | nan | 12.5547 | 12.4208 | | AllenaiLongformerBase | 1 | 6.217 | 15.4057 | 81.2496 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8819 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9643 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9539 | 0.4209 | nan | 0.8224 | 1.0095 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | nan | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8115 | 0.3774 | 0.7597 | 0.807 | 0.9258 | | T5Small | 1 | 1.0 | 0.8947 | nan | nan | 0.7934 | 1.0493 | | ElectraForCausalLM | 32 | 0.9983 | 0.883 | nan | nan | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9858 | 0.8573 | nan | nan | 0.7901 | 0.8727 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.7711 | 1.1049 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3714 | nan | 0.7698 | 0.9373 | | M2M100ForConditionalGeneration | 8 | 0.973 | 0.9393 | 0.3738 | nan | 0.7621 | 1.0278 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8862 | nan | nan | 0.7603 | 0.9397 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3615 | nan | 0.7487 | 0.9184 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8957 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.842 | 0.3524 | nan | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8656 | nan | nan | 0.6761 | 0.8847 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3642 | nan | 0.6375 | 0.8974 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8355 | nan | nan | 0.4998 | 0.6646 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.8282 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.321 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | dm_nfnet_f0 | 128 | 0.9999 | 1.0007 | 0.0 | 1.2438 | 1.4739 | 1.4243 | | convnext_base | 64 | 0.9999 | 0.9993 | 0.0 | 0.0 | 1.4678 | 1.4577 | | hrnet_w18 | 128 | 1.0001 | 0.9994 | 0.0 | 0.0 | 1.4161 | 1.3709 | | volo_d1_224 | 64 | 1.0001 | 0.9961 | 0.0 | 0.0 | 1.3852 | 1.3623 | | dla102 | 128 | 0.9999 | 1.0005 | 0.0 | 0.0 | 1.3831 | 1.3688 | | nfnet_l0 | 128 | 0.9998 | 0.7888 | 0.0 | 1.2192 | 1.3734 | 1.3272 | | xcit_large_24_p8_224 | 5 | 1.0 | 0.9757 | 0.0 | 0.0 | 1.353 | 1.3168 | | res2net50_14w_8s | 128 | 0.9999 | 1.0003 | 0.0 | 1.2379 | 1.3512 | 1.3241 | | inception_v3 | 128 | 0.9999 | 0.9988 | 0.0 | 1.128 | 1.3282 | 1.3077 | | crossvit_9_240 | 128 | 0.9997 | 0.9995 | 0.0 | 0.0 | 1.3277 | 1.3024 | | adv_inception_v3 | 128 | 1.0001 | 0.9975 | 0.0 | 1.128 | 1.3271 | 1.3066 | | gluon_inception_v3 | 128 | 0.9999 | 0.9978 | 0.0 | 1.1283 | 1.3269 | 1.3078 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 0.0 | 1.316 | 1.2709 | | res2next50 | 128 | 1.0001 | 1.0002 | 0.0 | 1.1708 | 1.3106 | 1.2745 | | jx_nest_base | 32 | 0.9998 | 0.9956 | 0.0 | 0.0 | 1.2768 | 1.2506 | | coat_lite_mini | 128 | 1.0 | 0.9862 | 0.8512 | 0.0 | 1.2732 | 1.2625 | | selecsls42b | 128 | 0.9999 | 1.0005 | 0.8166 | 1.2128 | 1.2678 | 1.2519 | | convit_base | 64 | 0.9996 | 0.999 | 0.0 | 0.0 | 1.2343 | 1.2013 | | gmixer_24_224 | 128 | 1.0 | 0.8104 | 0.0 | 0.0 | 1.2342 | 1.2398 | | res2net101_26w_4s | 64 | 0.9998 | 0.9991 | 0.7751 | 1.1538 | 1.2263 | 1.1885 | | gmlp_s16_224 | 128 | 0.9999 | 0.9503 | 0.0 | 0.0 | 1.2023 | 1.1859 | | twins_pcpvt_base | 64 | 0.9995 | 0.9988 | 0.7552 | 0.0 | 1.2005 | 1.1696 | | pit_b_224 | 64 | 0.9998 | 0.9995 | 0.0 | 0.0 | 1.1877 | 1.1764 | | cait_m36_384 | 4 | 0.9997 | 1.0272 | 0.0 | 0.0 | 1.1826 | 1.1574 | | poolformer_m36 | 64 | 0.9999 | 0.9993 | 0.0 | 0.0 | 1.1664 | 1.1478 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9798 | 0.0 | 0.0 | 1.1404 | 1.1251 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9822 | 0.0 | 0.0 | 1.1113 | 1.1002 | | swsl_resnext101_32x16d | 32 | 1.0 | 1.0001 | 0.0 | 0.0 | 1.1072 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0001 | 0.9994 | 0.7713 | 0.0 | 1.0921 | 1.08 | | gluon_xception65 | 32 | 0.9997 | 0.9973 | 0.0 | 0.0 | 1.0871 | 1.0742 | | vit_base_patch16_224 | 64 | 0.9996 | 0.9984 | 0.7685 | 0.0 | 1.0851 | 1.0727 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 1.0627 | 1.0769 | 1.0738 | | mixer_b16_224 | 128 | 0.9993 | 0.9803 | 0.0 | 0.0 | 1.072 | 1.0633 | | visformer_small | 128 | 1.0002 | 1.0024 | 0.8001 | 1.0469 | 1.0456 | 1.0136 | | resmlp_12_224 | 128 | 0.9999 | 0.8544 | 0.6127 | 0.0 | 0.8595 | 0.794 | | mnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_efficientnet_b0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | spnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | sebotnet33ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | rexnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | repvgg_a2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | regnety_002 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | pnasnet5large | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilevit_s | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv3_large_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv2_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9991 | 0.0 | 0.0 | 0.0 | 1.5414 | | mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | lcnet_050 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ghostnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | gernet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetv3_b | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetc_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ese_vovnet19b_dw | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_halonext26ts | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_botnext26ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | dpn107 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | cspdarknet53 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | botnet26t_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tinynet_a | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | pit_b_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | fail_accuracy | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.9484 | 35.2863 | nan | nan | 108.0315 | 106.2714 | | swin_base_patch4_window7_224 | 64 | 2.5701 | 13.629 | nan | nan | 94.5532 | 91.6968 | | twins_pcpvt_base | 64 | 2.0451 | 14.2419 | 22.8996 | nan | 90.657 | 87.8947 | | xcit_large_24_p8_224 | 5 | 2.7179 | 19.3408 | nan | nan | 88.034 | 85.905 | | convnext_base | 64 | 1.2998 | 6.9446 | nan | nan | 74.5412 | 73.5834 | | cait_m36_384 | 4 | 2.7501 | 19.8564 | nan | nan | 73.2553 | 70.9807 | | jx_nest_base | 32 | 1.6444 | 10.9105 | nan | nan | 70.3884 | 68.2462 | | resnest101e | 64 | 3.043 | 18.1436 | nan | nan | 68.0995 | 66.3658 | | coat_lite_mini | 128 | 1.0013 | 5.6017 | 8.4255 | nan | 65.4974 | 65.5086 | | res2net101_26w_4s | 64 | 2.9053 | 18.8474 | 29.6267 | 332.4276 | 57.281 | 54.3212 | | res2net50_14w_8s | 128 | 2.6213 | 16.8193 | nan | 312.1092 | 54.2768 | 50.7738 | | poolformer_m36 | 64 | 1.8002 | 10.5725 | nan | nan | 49.7294 | 45.7485 | | gmlp_s16_224 | 128 | 0.9797 | 7.0105 | nan | nan | 47.1773 | 46.4153 | | crossvit_9_240 | 128 | 1.3723 | 8.9236 | nan | nan | 44.8703 | 43.3053 | | gluon_xception65 | 32 | 1.7554 | 12.089 | nan | nan | 41.4818 | 38.4695 | | volo_d1_224 | 64 | 1.2023 | 8.4574 | nan | nan | 40.9391 | 38.816 | | gmixer_24_224 | 128 | 1.0626 | 8.1593 | nan | nan | 35.9423 | 34.9184 | | adv_inception_v3 | 128 | 1.5093 | 9.5638 | nan | 131.3115 | 35.8836 | 33.4975 | | swsl_resnext101_32x16d | 32 | 1.6754 | 10.9739 | nan | nan | 35.501 | 33.8114 | | gluon_inception_v3 | 128 | 1.5162 | 9.7713 | nan | 128.9065 | 35.159 | 33.5405 | | inception_v3 | 128 | 1.4779 | 9.7043 | nan | 131.0541 | 34.8808 | 33.5684 | | dla102 | 128 | 1.7254 | 11.003 | nan | nan | 33.1722 | 31.7389 | | convit_base | 64 | 1.0413 | 6.5458 | nan | nan | 32.1138 | 30.5407 | | dm_nfnet_f0 | 128 | 1.9665 | 8.2352 | nan | 146.5402 | 30.7749 | 29.4791 | | res2next50 | 128 | 1.5002 | 9.4677 | nan | 163.6576 | 30.1356 | 28.5072 | | convmixer_768_32 | 32 | 1.1232 | 7.0713 | nan | 82.4726 | 26.4487 | 26.1778 | | visformer_small | 128 | 0.812 | 4.5267 | 6.3618 | 59.8554 | 26.3332 | 25.4159 | | resmlp_12_224 | 128 | 0.631 | 3.1262 | 5.8599 | nan | 25.1774 | 26.0094 | | mixer_b16_224 | 128 | 0.6756 | 3.5733 | nan | nan | 24.9493 | 24.0081 | | nfnet_l0 | 128 | 1.6754 | 8.1812 | nan | 129.794 | 22.9681 | 21.8652 | | beit_base_patch16_224 | 64 | 1.1046 | 5.7235 | nan | nan | 22.9281 | 20.6227 | | deit_base_distilled_patch16_224 | 64 | 0.8276 | 4.7058 | 7.1412 | nan | 22.3523 | 21.4466 | | vit_base_patch16_224 | 64 | 0.8379 | 4.8114 | 6.9289 | nan | 22.2391 | 21.1189 | | pit_b_224 | 64 | 0.8849 | 5.3584 | nan | nan | 19.7707 | 18.5814 | | selecsls42b | 128 | 0.7933 | 4.3505 | 6.2699 | 67.7105 | 17.0018 | 15.9934 | | tnt_s_patch16_224 | 128 | 1.5735 | 11.2991 | nan | nan | nan | 36.3205 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | nan | 1.5552 | 1.6267 | | nfnet_l0 | 128 | 0.993 | 0.8275 | nan | 0.8271 | 1.2905 | 1.4934 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1185 | 1.1746 | | poolformer_m36 | 64 | 0.9983 | 0.9509 | nan | nan | 1.0522 | 1.0698 | | dm_nfnet_f0 | 128 | 0.9357 | 0.894 | nan | 0.8793 | 1.0221 | 1.0963 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | nan | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | nan | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9923 | 1.0857 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9792 | 0.9847 | 0.9968 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | nan | 0.9827 | 1.0538 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | nan | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | nan | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9828 | 0.9169 | nan | nan | 0.9625 | 1.0421 | | gluon_xception65 | 32 | 0.9975 | 0.9358 | nan | nan | 0.9412 | 0.9929 | | hrnet_w18 | 128 | 0.9955 | 0.9252 | nan | nan | 0.9382 | 1.0121 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | nan | 0.9348 | 1.0604 | | res2net101_26w_4s | 64 | 0.9967 | 0.9278 | 0.3243 | 0.8768 | 0.93 | 1.0167 | | gluon_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9139 | 1.063 | | inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9139 | 1.063 | | adv_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9138 | 1.063 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9955 | 0.9149 | nan | 0.8461 | 0.9075 | 1.0161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0516 | | visformer_small | 128 | 0.9944 | 0.9374 | 0.3291 | 0.9282 | 0.9029 | 0.9934 | | selecsls42b | 128 | 0.9885 | 0.8897 | 0.337 | 0.8775 | 0.8987 | 1.0049 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.8965 | nan | nan | 0.8912 | 0.9925 | | res2net50_14w_8s | 128 | 0.995 | 0.9047 | nan | 0.8422 | 0.8821 | 1.0211 | | pit_b_224 | 64 | 0.9968 | 0.7946 | nan | nan | 0.8563 | 1.0753 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | nan | 0.8208 | 1.0244 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | nan | 0.7899 | 0.7979 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | nan | nan | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/aYDBdLc.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/ho2lGdA.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/EqpLfxa.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 87%, 46/53 | 100%, 42/42 | 59%, 36/61  |
|       aot_eager        | 85%, 45/53 | 100%, 42/42 | 54%, 33/61  |
|     aot_cudagraphs     | 70%, 37/53 | 57%, 24/42  | 38%, 23/61  |
|    nvprims_nvfuser     | 17%, 9/53  |  5%, 2/42   |  2%, 1/61   |
|        inductor        | 75%, 40/53 | 93%, 39/42  | 57%, 35/61  |
| inductor_no_cudagraphs | 79%, 42/53 | 93%, 39/42  | 57%, 35/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.13x    |    1.05x    |    1.00x    |
|    nvprims_nvfuser     |   1.00x    |    1.00x    |    1.00x    |
|        inductor        |   1.74x    |    1.79x    |    1.43x    |
| inductor_no_cudagraphs |   1.38x    |    1.54x    |    1.39x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.98    |    2.70     |    1.94     |
|       aot_eager        |    8.38    |    13.52    |    12.71    |
|     aot_cudagraphs     |   12.66    |    24.30    |    20.37    |
|    nvprims_nvfuser     |   14.36    |    67.23    |   150.44    |
|        inductor        |   27.66    |    41.18    |    53.94    |
| inductor_no_cudagraphs |   28.45    |    35.92    |    51.63    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.95x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.90x    |
|     aot_cudagraphs     |   0.42x    |    0.39x    |    0.33x    |
|    nvprims_nvfuser     |   0.75x    |    0.81x    |    0.52x    |
|        inductor        |   0.82x    |    0.91x    |    0.95x    |
| inductor_no_cudagraphs |   0.96x    |    1.08x    |    1.04x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0017 | 0.8767 | 1.8679 | 0.0 | 5.3042 | 1.4661 | | functorch_dp_cifar10 | 64 | 1.0061 | 0.8799 | 1.679 | 0.0 | 3.8195 | 1.5143 | | timm_vision_transformer | 8 | 1.0008 | 0.8641 | 1.9078 | 0.0 | 3.1965 | 1.5384 | | resnext50_32x4d | 8 | 1.0011 | 0.9215 | 1.2971 | 0.0 | 2.7614 | 1.3831 | | hf_T5_large | 2 | 1.0192 | 0.8196 | 0.0 | 0.0 | 2.6494 | 1.9978 | | drq | 1 | 0.9969 | 0.7785 | 1.5917 | 0.0 | 2.5102 | 1.2002 | | mobilenet_v3_large | 32 | 1.006 | 1.0008 | 1.1968 | 0.0 | 2.4523 | 1.5536 | | hf_Albert | 8 | 1.0001 | 0.9545 | 0.7752 | 0.0 | 2.3829 | 2.3201 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9953 | 0.8866 | 1.3398 | 0.0 | 2.2025 | 1.5501 | | lennard_jones | 1000 | 0.9762 | 0.7537 | 1.2692 | 0.4508 | 2.1116 | 1.0589 | | mnasnet1_0 | 32 | 0.9987 | 1.005 | 1.0119 | 0.0 | 2.0908 | 1.4197 | | pytorch_struct | 200 | 0.989 | 0.7298 | 1.0118 | 0.6182 | 2.033 | 1.2781 | | hf_Bert | 4 | 1.0368 | 0.8475 | 0.9356 | 0.0 | 2.0209 | 1.8309 | | resnet18 | 16 | 1.0011 | 0.9982 | 1.1243 | 0.0 | 1.9727 | 1.3722 | | hf_GPT2 | 4 | 1.022 | 0.9841 | 0.0 | 0.3021 | 1.9372 | 1.9021 | | squeezenet1_1 | 32 | 1.0037 | 0.9432 | 1.0856 | 0.6873 | 1.8919 | 1.4817 | | timm_resnest | 32 | 0.9977 | 0.9875 | 0.8123 | 0.0 | 1.8488 | 1.7166 | | attention_is_all_you_need_pytorch | 256 | 1.0073 | 0.9065 | 0.0 | 0.0 | 1.8239 | 1.4793 | | hf_Bart | 4 | 1.0114 | 0.8285 | 0.0 | 0.0 | 1.7812 | 1.7182 | | dcgan | 32 | 0.9579 | 0.8837 | 1.1434 | 0.0 | 1.7167 | 1.1173 | | speech_transformer | 32 | 1.0024 | 0.8369 | 1.9409 | 0.0 | 1.6958 | 1.7016 | | soft_actor_critic | 256 | 0.981 | 0.7361 | 1.2896 | 0.5251 | 1.6668 | 1.059 | | LearningToPaint | 96 | 0.9958 | 1.0013 | 0.9277 | 0.0 | 1.6019 | 1.4073 | | hf_T5 | 8 | 0.9992 | 0.8523 | 0.0 | 0.0 | 1.58 | 1.5825 | | mobilenet_v2 | 96 | 1.0001 | 0.99 | 0.7606 | 0.0 | 1.5596 | 1.5029 | | shufflenet_v2_x1_0 | 128 | 1.0015 | 1.0068 | 0.8745 | 0.0 | 1.5516 | 1.3886 | | fastNLP_Bert | 6 | 0.9994 | 0.9068 | 0.7655 | 0.0 | 1.5298 | 1.4822 | | hf_DistilBert | 8 | 1.0014 | 0.9744 | 0.7334 | 0.0 | 1.518 | 1.4826 | | timm_nfnet | 128 | 1.0001 | 0.9991 | 0.0 | 1.0887 | 1.5101 | 1.43 | | resnet50 | 32 | 1.0003 | 1.035 | 0.8076 | 0.0 | 1.3559 | 1.2889 | | pytorch_unet | 1 | 0.9994 | 0.9923 | 0.8618 | 0.0 | 1.3445 | 1.3152 | | Super_SloMo | 6 | 0.9994 | 0.9962 | 0.8847 | 0.0 | 1.2899 | 1.2596 | | vgg16 | 64 | 0.9997 | 0.9966 | 0.8571 | 0.9738 | 1.2732 | 1.2647 | | pytorch_stargan | 16 | 0.9976 | 1.0525 | 0.904 | 0.0 | 1.2683 | 1.1909 | | Background_Matting | 4 | 1.0007 | 1.0184 | 0.8917 | 0.0 | 1.237 | 1.2211 | | alexnet | 128 | 0.9984 | 0.997 | 0.815 | 0.934 | 1.2105 | 1.207 | | hf_Reformer | 4 | 0.9953 | 0.9906 | 0.9378 | 0.0 | 1.1642 | 1.1414 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9901 | 0.0 | 0.0 | 1.1594 | 1.1374 | | hf_BigBird | 2 | 0.9908 | 0.9147 | 1.0488 | 0.0 | 1.1477 | 1.0209 | | yolov3 | 16 | 0.9997 | 0.99 | 0.8043 | 0.0 | 1.0909 | 1.0661 | | tts_angular | 64 | 0.9821 | 0.941 | 0.985 | 0.9662 | 1.0075 | 1.0194 | | demucs | 4 | 0.9996 | 1.0009 | 0.9999 | 1.001 | 1.0007 | 0.9984 | | nvidia_deeprecommender | 256 | 0.9987 | 0.9962 | 0.6966 | 0.5491 | 0.9894 | 1.0311 | | hf_GPT2_large | 4 | 1.0006 | 0.9904 | 0.0 | 0.0 | 0.0 | 1.8668 | | tacotron2 | 64 | 0.9777 | 0.735 | 0.9471 | 0.0 | 0.0 | 0.8561 | | hf_Longformer | 2 | 0.9568 | 0.87 | 0.8883 | 0.0 | 0.0 | 0.0 | | dlrm | 2048 | 1.0793 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | BERT_pytorch | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientdet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_efficientnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_regnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | timm_vovnet | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | pass | pass | pass | | drq | 1 | pass | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_accuracy | | yolov3 | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | BERT_pytorch | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9763 | 11.0229 | 14.8961 | nan | 409.9205 | 408.0533 | | hf_T5_large | 2 | 13.9433 | 52.7541 | nan | nan | 144.3417 | 138.3182 | | timm_vision_transformer_large | 8 | 2.7834 | 20.887 | nan | nan | 85.6191 | 84.278 | | timm_resnest | 32 | 0.6126 | 3.6155 | 4.9908 | nan | 59.5452 | 63.7731 | | densenet121 | 4 | 2.2486 | 17.4578 | 25.6092 | nan | 55.1542 | 52.3918 | | timm_vision_transformer | 8 | 1.0035 | 6.6466 | 8.7737 | nan | 52.0228 | 52.3662 | | hf_BigBird | 2 | 8.3752 | 17.5081 | 37.5596 | nan | 50.2892 | 31.6033 | | attention_is_all_you_need_pytorch | 256 | 1.3055 | 9.9586 | nan | nan | 46.9763 | 46.9481 | | pytorch_stargan | 16 | 0.4174 | 2.9238 | 3.8744 | nan | 45.172 | 51.2935 | | hf_Bart | 4 | 1.7877 | 11.9029 | nan | nan | 36.2987 | 35.3166 | | pytorch_struct | 200 | 0.2831 | 1.1606 | 1.7987 | 8.3402 | 35.1987 | 35.6394 | | hf_T5 | 8 | 2.4134 | 12.6166 | nan | nan | 34.4725 | 33.0526 | | timm_nfnet | 128 | 2.0912 | 9.5724 | nan | 159.4507 | 32.7689 | 31.9833 | | mobilenet_v3_large | 32 | 0.9483 | 6.7013 | 9.1014 | nan | 32.4792 | 31.8483 | | fastNLP_Bert | 6 | 1.8171 | 9.804 | 14.2143 | nan | 32.1872 | 30.6062 | | speech_transformer | 32 | 1.9498 | 12.0083 | 84.0224 | nan | 30.5092 | 29.0719 | | hf_Reformer | 4 | 2.5293 | 5.3031 | 10.1018 | nan | 27.1454 | 21.8894 | | hf_Albert | 8 | 1.3823 | 8.9583 | 12.9382 | nan | 25.5438 | 24.4483 | | mnasnet1_0 | 32 | 0.8715 | 6.1488 | 8.4065 | nan | 25.2198 | 23.2334 | | resnet50 | 32 | 0.9185 | 6.5917 | 8.9665 | nan | 24.8418 | 23.8396 | | resnext50_32x4d | 8 | 0.9614 | 6.5833 | 9.0704 | nan | 24.0277 | 24.0333 | | hf_GPT2 | 4 | 1.488 | 8.5027 | nan | 87.2163 | 24.0244 | 23.0912 | | hf_Bert | 4 | 1.616 | 9.1365 | 12.6729 | nan | 22.9406 | 22.3991 | | shufflenet_v2_x1_0 | 128 | 1.0156 | 7.2094 | 9.8392 | nan | 22.7907 | 21.2181 | | Background_Matting | 4 | 0.9909 | 6.3246 | 8.9642 | nan | 20.7954 | 19.6994 | | Super_SloMo | 6 | 1.1028 | 6.4484 | 8.0831 | nan | 20.6696 | 20.2671 | | mobilenet_v2 | 96 | 0.8584 | 6.1171 | 8.6006 | nan | 20.2726 | 20.4217 | | functorch_dp_cifar10 | 64 | 0.3929 | 2.728 | 3.6415 | nan | 17.6071 | 17.6635 | | hf_DistilBert | 8 | 0.6198 | 4.4502 | 8.5208 | nan | 16.2671 | 15.9246 | | resnet18 | 16 | 0.4406 | 2.578 | 3.4999 | nan | 14.963 | 14.3136 | | pytorch_unet | 1 | 0.4727 | 2.9075 | 3.8839 | nan | 9.7682 | 9.5427 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4313 | 3.0172 | 3.9676 | nan | 9.6706 | 9.4831 | | LearningToPaint | 96 | 0.4636 | 2.7234 | 3.7035 | nan | 8.6117 | 8.563 | | squeezenet1_1 | 32 | 0.2598 | 1.4481 | 1.9926 | 7.8958 | 5.4965 | 5.0324 | | vgg16 | 64 | 0.2009 | 0.987 | 1.4471 | 5.7818 | 4.5204 | 4.0277 | | drq | 1 | 0.1603 | 0.6902 | 1.0892 | nan | 4.1959 | 3.6484 | | nvidia_deeprecommender | 256 | 0.2202 | 0.6403 | 0.9543 | 6.6318 | 4.008 | 3.7148 | | soft_actor_critic | 256 | 0.2146 | 0.4478 | 0.6961 | 3.6869 | 3.5844 | 2.9922 | | alexnet | 128 | 0.1658 | 0.6135 | 0.9135 | 5.2849 | 3.4326 | 3.2888 | | dcgan | 32 | 0.1805 | 0.5746 | 0.82 | nan | 3.0609 | 2.724 | | lennard_jones | 1000 | 0.1574 | 0.4528 | 0.6439 | 4.0307 | 2.2498 | 1.9852 | | tts_angular | 64 | 0.2383 | 0.3016 | 0.4306 | 1.473 | 1.9203 | 1.6758 | | demucs | 4 | 0.3498 | 0.35 | 0.3597 | 0.3498 | 0.2603 | 0.2846 | | tacotron2 | 64 | 17.7834 | 35.4073 | 57.7449 | nan | nan | 71.2238 | | hf_GPT2_large | 4 | 5.5891 | 26.8395 | nan | nan | nan | 59.2869 | | hf_Longformer | 2 | 6.4093 | 17.2741 | 86.3997 | nan | nan | nan | | dlrm | 2048 | 0.4518 | nan | nan | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 0.9814 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3306 | nan | 1.1102 | 1.1145 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.312 | nan | 1.0603 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0536 | 1.2945 | | timm_nfnet | 128 | 0.9691 | 0.8985 | nan | 0.7873 | 1.0337 | 1.1293 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | nan | 1.0179 | 1.1759 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9866 | | Background_Matting | 4 | 1.0059 | 0.9548 | 0.3707 | nan | 0.9832 | 1.0337 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | 0.8601 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0017 | 0.8736 | 0.4209 | nan | 0.9335 | 1.0082 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3571 | nan | 0.911 | 1.0853 | | yolov3 | 16 | 0.985 | 0.834 | 0.3518 | nan | 0.901 | 1.0402 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | nan | 0.8877 | 1.2564 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | nan | nan | 0.879 | 0.9541 | | densenet121 | 4 | 0.9883 | 0.866 | 0.3665 | nan | 0.876 | 1.0026 | | timm_resnest | 32 | 0.9875 | 0.8721 | 0.3482 | nan | 0.876 | 0.9969 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3903 | nan | 0.8735 | 0.942 | | squeezenet1_1 | 32 | 0.9595 | 0.7951 | 0.346 | 0.5757 | 0.8731 | 1.0627 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8419 | 0.3593 | nan | 0.8727 | 0.9966 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | resnet50 | 32 | 0.9888 | 0.8617 | 0.3557 | nan | 0.8647 | 0.8839 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3413 | nan | 0.8384 | 0.9049 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | nan | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9102 | 0.831 | nan | nan | 0.8226 | 0.9872 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4542 | nan | 0.8111 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.7444 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8503 | 0.3454 | nan | 0.7902 | 0.816 | | pytorch_stargan | 16 | 0.9952 | 0.9711 | 0.426 | nan | 0.782 | 0.8862 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.6125 | 0.7633 | 1.0588 | | resnext50_32x4d | 8 | 0.9947 | 0.8545 | 0.3873 | nan | 0.7622 | 0.7746 | | mnasnet1_0 | 32 | 0.9788 | 0.8617 | 0.3406 | nan | 0.7529 | 0.7734 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | nan | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9245 | 0.7232 | 0.3858 | nan | 0.7349 | 0.9262 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3924 | nan | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3944 | nan | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4353 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1353 | | tacotron2 | 64 | 0.9866 | 0.3963 | 0.3142 | nan | nan | 0.4114 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.349 | nan | nan | nan | | dlrm | 2048 | 0.7302 | nan | nan | nan | nan | nan | | BERT_pytorch | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientdet | 0 | nan | nan | nan | nan | nan | nan | | timm_efficientnet | 0 | nan | nan | nan | nan | nan | nan | | timm_regnet | 0 | nan | nan | nan | nan | nan | nan | | timm_vovnet | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0239 | 0.826 | 0.0 | 0.0 | 4.614 | 1.6318 | | MobileBertForMaskedLM | 32 | 1.0181 | 0.8297 | 0.0 | 0.0 | 4.2693 | 1.7762 | | CamemBert | 1 | 1.0293 | 0.834 | 1.8354 | 0.0 | 4.2196 | 1.7765 | | MT5ForConditionalGeneration | 8 | 1.0226 | 0.8289 | 0.0 | 0.0 | 4.1679 | 2.3001 | | MobileBertForQuestionAnswering | 64 | 1.0184 | 0.8123 | 0.0 | 0.0 | 3.8676 | 1.7743 | | DistillGPT2 | 1 | 1.0226 | 0.8669 | 1.3604 | 0.2463 | 2.6905 | 1.969 | | M2M100ForConditionalGeneration | 8 | 1.0401 | 0.8773 | 1.2437 | 0.0 | 2.3511 | 1.7826 | | GPT2ForSequenceClassification | 4 | 1.0013 | 0.9789 | 0.0 | 0.5041 | 2.3166 | 2.2754 | | MegatronBertForCausalLM | 16 | 1.0296 | 0.8479 | 0.961 | 0.0 | 2.1805 | 1.7624 | | ElectraForQuestionAnswering | 64 | 1.0006 | 0.9702 | 0.7679 | 0.0 | 2.1234 | 2.0713 | | MegatronBertForQuestionAnswering | 16 | 1.0361 | 0.8528 | 1.0944 | 0.0 | 2.006 | 1.7952 | | PLBartForConditionalGeneration | 16 | 1.0153 | 0.8298 | 0.0 | 0.0 | 1.9522 | 1.7389 | | ElectraForCausalLM | 32 | 0.9999 | 0.9435 | 0.7154 | 0.0 | 1.8841 | 1.8924 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9819 | 0.777 | 0.0 | 1.8715 | 1.7974 | | XGLMForCausalLM | 8 | 1.0139 | 0.821 | 0.929 | 0.0 | 1.8012 | 1.6165 | | PegasusForConditionalGeneration | 16 | 1.0131 | 0.8312 | 0.9119 | 0.0 | 1.6901 | 1.5508 | | MBartForConditionalGeneration | 16 | 1.0118 | 0.8365 | 0.0 | 0.0 | 1.6759 | 1.5993 | | T5Small | 1 | 1.0263 | 0.8571 | 0.0 | 0.0 | 1.6711 | 1.4088 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9614 | 0.7559 | 0.0 | 1.6641 | 1.6448 | | AlbertForQuestionAnswering | 4 | 0.9998 | 0.8857 | 0.0 | 0.0 | 1.6605 | 1.6541 | | AlbertForMaskedLM | 4 | 1.0003 | 0.8853 | 0.0 | 0.0 | 1.6422 | 1.6485 | | Speech2Text2ForCausalLM | 128 | 1.0047 | 0.9364 | 0.7218 | 0.0 | 1.6202 | 1.5932 | | OPTForCausalLM | 32 | 1.0068 | 0.9311 | 0.0 | 0.0 | 1.5765 | 1.5643 | | RobertaForQuestionAnswering | 128 | 0.9999 | 0.984 | 0.7802 | 0.0 | 1.5036 | 1.4606 | | BertForQuestionAnswering | 128 | 1.0 | 0.9831 | 0.773 | 0.0 | 1.4942 | 1.4657 | | DistilBertForQuestionAnswering | 64 | 1.0009 | 0.9508 | 0.7408 | 0.0 | 1.4904 | 1.4451 | | RobertaForCausalLM | 64 | 1.0004 | 0.9604 | 0.7536 | 0.0 | 1.4628 | 1.4494 | | BartForConditionalGeneration | 2 | 1.0043 | 0.9692 | 0.0 | 0.0 | 1.4603 | 1.4231 | | BartForCausalLM | 4 | 1.0007 | 0.9692 | 0.0 | 0.0 | 1.456 | 1.4556 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0064 | 0.9236 | 0.0 | 0.0 | 1.4308 | 1.4688 | | T5ForConditionalGeneration | 4 | 0.9995 | 0.8513 | 0.0 | 0.0 | 1.411 | 1.4084 | | BertForMaskedLM | 64 | 1.0006 | 0.9564 | 0.7341 | 0.0 | 1.3769 | 1.3559 | | DebertaForMaskedLM | 4 | 0.9307 | 0.7206 | 0.9155 | 0.0 | 1.3204 | 1.1667 | | BlenderbotSmallForCausalLM | 64 | 1.0037 | 0.9271 | 0.0 | 0.0 | 1.3163 | 1.3107 | | PLBartForCausalLM | 32 | 1.0081 | 0.9288 | 0.7912 | 0.0 | 1.306 | 1.3095 | | DistilBertForMaskedLM | 64 | 1.0006 | 0.9531 | 0.7101 | 0.0 | 1.2986 | 1.2922 | | MBartForCausalLM | 32 | 1.0012 | 0.9383 | 0.0 | 0.0 | 1.212 | 1.2098 | | TrOCRForCausalLM | 32 | 1.0018 | 0.9414 | 0.0 | 0.0 | 1.2056 | 1.2064 | | PegasusForCausalLM | 32 | 1.0018 | 0.9523 | 0.7496 | 0.0 | 1.1947 | 1.1978 | | DebertaForQuestionAnswering | 8 | 0.9918 | 0.8952 | 0.7238 | 0.0 | 1.1866 | 1.2476 | | BigBird | 1 | 0.9941 | 0.912 | 1.0239 | 0.0 | 1.1543 | 1.0196 | | AllenaiLongformerBase | 1 | 0.9371 | 0.7211 | 0.9211 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.0383 | 13.3498 | 48.3919 | nan | 118.8758 | 41.5633 | | XGLMForCausalLM | 8 | 2.7712 | 17.8961 | 61.02 | nan | 109.4076 | 106.3848 | | DebertaForMaskedLM | 4 | 5.0581 | 13.3804 | 48.8608 | nan | 109.1789 | 39.7697 | | M2M100ForConditionalGeneration | 8 | 3.4494 | 21.7769 | 34.3922 | nan | 87.6177 | 83.5918 | | MobileBertForMaskedLM | 32 | 9.4087 | 42.9125 | nan | nan | 79.6869 | 78.1582 | | MobileBertForQuestionAnswering | 64 | 9.4364 | 43.0802 | nan | nan | 78.2829 | 76.7187 | | PegasusForConditionalGeneration | 16 | 3.3676 | 24.1735 | 35.1847 | nan | 61.1011 | 56.6773 | | YituTechConvBert | 1 | 2.492 | 14.2617 | nan | nan | 59.5184 | 55.8722 | | MBartForConditionalGeneration | 16 | 3.5003 | 23.0304 | nan | nan | 58.0567 | 56.3379 | | BartForConditionalGeneration | 2 | 3.4975 | 23.4574 | nan | nan | 57.0317 | 56.1739 | | BigBird | 1 | 8.1897 | 17.6364 | 40.3299 | nan | 48.8434 | 31.905 | | MegatronBertForCausalLM | 16 | 3.6143 | 19.2551 | 27.2191 | nan | 47.1022 | 45.6927 | | MegatronBertForQuestionAnswering | 16 | 3.628 | 19.1526 | 27.6545 | nan | 45.1875 | 44.8777 | | MT5ForConditionalGeneration | 8 | 3.6649 | 17.6853 | nan | nan | 44.4499 | 43.5994 | | T5Small | 1 | 2.5091 | 12.4457 | nan | nan | 40.6273 | 39.9536 | | BlenderbotSmallForConditionalGeneration | 64 | 2.2293 | 14.9556 | nan | nan | 40.3905 | 39.5065 | | LayoutLMForSequenceClassification | 16 | 2.0241 | 9.8095 | 14.0728 | nan | 38.0131 | 35.3809 | | PLBartForConditionalGeneration | 16 | 1.7721 | 12.1019 | nan | nan | 35.9261 | 35.4451 | | T5ForConditionalGeneration | 4 | 2.4772 | 12.3363 | nan | nan | 35.2701 | 35.0956 | | ElectraForCausalLM | 32 | 1.7312 | 9.5563 | 14.0642 | nan | 31.5697 | 29.2995 | | PegasusForCausalLM | 32 | 1.2853 | 8.6368 | 12.8444 | nan | 28.2984 | 27.1092 | | MBartForCausalLM | 32 | 1.2665 | 8.6136 | nan | nan | 26.879 | 25.8299 | | LayoutLMForMaskedLM | 16 | 2.0487 | 9.9474 | 14.336 | nan | 26.3214 | 25.4065 | | BertForMaskedLM | 64 | 1.5482 | 9.9601 | 13.191 | nan | 26.1158 | 24.6212 | | TrOCRForCausalLM | 32 | 1.2552 | 8.7548 | nan | nan | 26.0721 | 25.7277 | | OPTForCausalLM | 32 | 1.2683 | 9.1719 | nan | nan | 25.6483 | 25.463 | | BartForCausalLM | 4 | 1.3171 | 8.646 | nan | nan | 25.5691 | 25.1006 | | ElectraForQuestionAnswering | 64 | 1.7333 | 9.5633 | 13.3046 | nan | 25.4781 | 24.4479 | | BertForQuestionAnswering | 128 | 1.6183 | 10.1354 | 13.2257 | nan | 24.0511 | 23.1243 | | RobertaForCausalLM | 64 | 1.5599 | 9.4396 | 13.7073 | nan | 23.9477 | 23.6089 | | CamemBert | 1 | 1.6761 | 9.4983 | 12.7807 | nan | 23.0142 | 22.527 | | GPT2ForSequenceClassification | 4 | 1.5435 | 8.6531 | nan | 87.2687 | 22.4882 | 22.2024 | | RobertaForQuestionAnswering | 128 | 1.5972 | 9.6973 | 13.588 | nan | 22.2872 | 21.7395 | | AlbertForMaskedLM | 4 | 1.4378 | 9.1933 | nan | nan | 21.9562 | 20.8748 | | AlbertForQuestionAnswering | 4 | 1.4659 | 9.134 | nan | nan | 21.0039 | 20.1324 | | BlenderbotSmallForCausalLM | 64 | 0.8766 | 5.9693 | nan | nan | 20.7434 | 20.3335 | | DistillGPT2 | 1 | 0.7916 | 4.3628 | 5.9227 | 47.1883 | 18.4908 | 18.5576 | | PLBartForCausalLM | 32 | 0.637 | 4.5096 | 6.1915 | nan | 18.0146 | 17.6752 | | Speech2Text2ForCausalLM | 128 | 0.6861 | 4.4472 | 7.0236 | nan | 17.9145 | 17.4842 | | DistilBertForMaskedLM | 64 | 0.5906 | 4.5501 | 8.8807 | nan | 15.1859 | 14.6923 | | DistilBertForQuestionAnswering | 64 | 0.6236 | 4.5626 | 8.884 | nan | 14.3996 | 14.032 | | AllenaiLongformerBase | 1 | 6.9061 | 18.2925 | 88.2485 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.1078 | 1.5319 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0943 | 1.1562 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | 0.8823 | 1.0775 | 1.1632 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 1.0189 | 1.0913 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9913 | 1.1976 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9868 | 1.0636 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9866 | 1.0264 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | nan | 0.9834 | 1.0756 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.9803 | 1.0361 | | RobertaForCausalLM | 64 | 0.9991 | 0.8994 | 0.3787 | nan | 0.9794 | 1.0352 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.9642 | 1.0376 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.884 | nan | nan | 0.9579 | 1.1475 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | nan | 0.9503 | 1.2292 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3634 | nan | 0.9481 | 1.0273 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.946 | 1.0791 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9335 | 1.0986 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.927 | 1.0178 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | nan | nan | 0.9269 | 1.0441 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | nan | 0.9267 | 1.0655 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3979 | nan | 0.9217 | 1.0168 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.919 | 0.919 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | nan | 0.9159 | 1.0993 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.9036 | 1.0275 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | nan | nan | 0.884 | 1.0295 | | BigBird | 1 | 1.0008 | 0.9547 | 0.4476 | nan | 0.8348 | 1.1039 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8333 | 1.0324 | | DistillGPT2 | 1 | 0.9963 | 0.7984 | 0.4005 | 0.7468 | 0.817 | 1.0176 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8153 | 0.931 | | YituTechConvBert | 1 | 0.9718 | 0.8648 | nan | nan | 0.7974 | 0.9279 | | M2M100ForConditionalGeneration | 8 | 1.0071 | 0.9583 | 0.4439 | nan | 0.7739 | 1.0501 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6698 | 0.8915 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3623 | nan | 0.4154 | 1.1123 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3251 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9476 | 0.3852 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 1.0 | 0.9983 | 0.0 | 0.0 | 2.1303 | 2.0979 | | xcit_large_24_p8_224 | 5 | 0.9982 | 0.0 | 0.0 | 0.0 | 1.8261 | 1.7367 | | twins_pcpvt_base | 64 | 1.0096 | 0.9124 | 0.9064 | 0.0 | 1.7904 | 1.7391 | | volo_d1_224 | 64 | 0.9998 | 0.9943 | 0.0 | 0.0 | 1.5991 | 1.5622 | | dla102 | 128 | 0.9998 | 0.9962 | 0.8391 | 0.0 | 1.5814 | 1.5495 | | nfnet_l0 | 128 | 1.0001 | 0.8094 | 0.7142 | 0.9419 | 1.5544 | 1.4699 | | gmlp_s16_224 | 128 | 0.9997 | 0.9432 | 0.0 | 0.0 | 1.5463 | 1.5207 | | hrnet_w18 | 128 | 1.0 | 0.9942 | 0.8385 | 0.0 | 1.5384 | 1.4651 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9576 | 0.0 | 0.0 | 1.5359 | 1.4954 | | gmixer_24_224 | 128 | 1.0 | 0.8433 | 0.0 | 0.0 | 1.5332 | 1.4764 | | resnest101e | 64 | 0.9999 | 0.9925 | 0.8203 | 0.0 | 1.5305 | 1.4346 | | coat_lite_mini | 128 | 1.0 | 0.9772 | 0.8598 | 0.0 | 1.5238 | 1.5076 | | adv_inception_v3 | 128 | 1.0 | 0.996 | 0.8544 | 0.0 | 1.5079 | 1.4722 | | gluon_inception_v3 | 128 | 1.0 | 0.9934 | 0.8524 | 0.0 | 1.5072 | 1.4722 | | inception_v3 | 128 | 0.9999 | 0.9936 | 0.8543 | 0.0 | 1.5028 | 1.4688 | | dm_nfnet_f0 | 128 | 0.9991 | 0.9987 | 0.0 | 1.0972 | 1.4972 | 1.4304 | | cait_m36_384 | 4 | 1.0004 | 1.011 | 0.0 | 0.0 | 1.4682 | 1.4122 | | res2net50_14w_8s | 128 | 1.0 | 0.9946 | 0.8128 | 0.0 | 1.4668 | 1.4081 | | crossvit_9_240 | 128 | 1.0001 | 0.9931 | 0.8387 | 0.0 | 1.4505 | 1.4154 | | selecsls42b | 128 | 0.9996 | 0.9954 | 0.8427 | 0.0 | 1.4452 | 1.4121 | | convnext_base | 64 | 0.9997 | 0.9968 | 0.0 | 0.0 | 1.4415 | 1.4358 | | res2next50 | 128 | 0.9992 | 0.9962 | 0.8328 | 0.0 | 1.4124 | 1.3459 | | jx_nest_base | 32 | 1.0 | 0.9938 | 0.0 | 0.0 | 1.3923 | 1.3565 | | convit_base | 64 | 1.0 | 0.9972 | 0.0 | 0.0 | 1.3397 | 1.3557 | | poolformer_m36 | 64 | 0.9998 | 0.9986 | 0.8073 | 0.0 | 1.3295 | 1.2969 | | res2net101_26w_4s | 64 | 1.0039 | 1.0055 | 0.7844 | 0.0 | 1.3222 | 1.2143 | | pit_b_224 | 64 | 0.9996 | 0.9959 | 0.8211 | 0.0 | 1.3203 | 1.3186 | | beit_base_patch16_224 | 64 | 1.0001 | 0.977 | 0.0 | 0.0 | 1.2822 | 1.2708 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9921 | 0.7972 | 0.0 | 1.2788 | 1.2642 | | mixer_b16_224 | 128 | 1.0001 | 0.9594 | 0.7773 | 0.0 | 1.2781 | 1.2671 | | visformer_small | 128 | 1.0001 | 1.0023 | 0.8438 | 0.0 | 1.2429 | 1.1843 | | vit_base_patch16_224 | 64 | 0.9998 | 0.9926 | 0.8338 | 0.0 | 1.1955 | 1.1838 | | gluon_xception65 | 32 | 0.9992 | 0.9881 | 0.7557 | 0.0 | 1.1579 | 1.126 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9862 | 0.8196 | 0.0 | 1.1512 | 1.0705 | | resmlp_12_224 | 128 | 1.0003 | 1.008 | 0.7898 | 0.0 | 1.0995 | 1.1031 | | convmixer_768_32 | 32 | 0.9999 | 0.998 | 0.9234 | 0.0 | 1.0558 | 1.0507 | | repvgg_a2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv3_large_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilevit_s | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | pnasnet5large | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | regnety_002 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | sebotnet33ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | rexnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | spnasnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_efficientnet_b0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tf_mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mobilenetv2_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | lcnet_050 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | mixnet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ghostnet_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | gernet_l | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetv3_b | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | fbnetc_100 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | ese_vovnet19b_dw | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_halonext26ts | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | eca_botnext26ts_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | dpn107 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | cspdarknet53 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | botnet26t_256 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tinynet_a | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pit_b_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | res2next50 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_to_run | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.8559 | 20.3902 | 33.5922 | nan | 149.4559 | 139.0497 | | hrnet_w18 | 128 | 6.5598 | 43.8169 | 73.3327 | nan | 130.5434 | 125.0465 | | swin_base_patch4_window7_224 | 64 | 2.9322 | 17.5498 | nan | nan | 98.6184 | 97.0174 | | xcit_large_24_p8_224 | 5 | 3.2573 | nan | nan | nan | 96.348 | 95.1547 | | convnext_base | 64 | 1.6522 | 9.9178 | nan | nan | 90.4142 | 89.6411 | | resnest101e | 64 | 3.4547 | 23.5134 | 35.895 | nan | 88.3368 | 82.2843 | | cait_m36_384 | 4 | 3.3986 | 26.458 | nan | nan | 84.6596 | 82.2956 | | jx_nest_base | 32 | 1.9293 | 12.7258 | nan | nan | 75.043 | 73.881 | | coat_lite_mini | 128 | 1.2265 | 7.4261 | 10.7706 | nan | 73.5471 | 73.3636 | | res2net101_26w_4s | 64 | 3.5389 | 23.5726 | 36.523 | nan | 70.5675 | 65.4051 | | res2net50_14w_8s | 128 | 2.8681 | 20.6828 | 31.9374 | nan | 63.6169 | 58.5497 | | gmlp_s16_224 | 128 | 1.2857 | 10.1588 | nan | nan | 53.1451 | 51.0354 | | poolformer_m36 | 64 | 1.9694 | 12.068 | 18.3023 | nan | 51.8117 | 49.3716 | | crossvit_9_240 | 128 | 1.6409 | 11.949 | 17.4197 | nan | 51.5512 | 49.5685 | | gluon_xception65 | 32 | 2.1018 | 15.352 | 22.9649 | nan | 47.9625 | 45.2323 | | volo_d1_224 | 64 | 1.3501 | 10.6092 | nan | nan | 47.8303 | 46.0939 | | tnt_s_patch16_224 | 128 | 1.8499 | 15.0264 | nan | nan | 46.306 | 43.4351 | | gmixer_24_224 | 128 | 1.4691 | 11.2179 | nan | nan | 42.036 | 40.4228 | | swsl_resnext101_32x16d | 32 | 1.8747 | 13.3092 | 19.3859 | nan | 41.0078 | 39.2914 | | adv_inception_v3 | 128 | 1.6592 | 12.2665 | 17.9679 | nan | 40.9793 | 38.4847 | | inception_v3 | 128 | 1.6901 | 12.5495 | 17.8476 | nan | 40.9639 | 38.8102 | | gluon_inception_v3 | 128 | 1.6708 | 12.5597 | 17.9522 | nan | 40.8232 | 38.6719 | | dla102 | 128 | 1.8331 | 13.5096 | 20.1957 | nan | 39.7957 | 37.3081 | | res2next50 | 128 | 1.6913 | 11.7683 | 16.8519 | nan | 35.3057 | 33.6372 | | convit_base | 64 | 1.2609 | 8.3201 | nan | nan | 34.9706 | 33.8347 | | dm_nfnet_f0 | 128 | 2.0788 | 9.6942 | nan | 160.3779 | 34.6106 | 32.8156 | | convmixer_768_32 | 32 | 1.2546 | 8.939 | 12.8912 | nan | 30.1678 | 28.021 | | visformer_small | 128 | 0.8498 | 5.3469 | 7.7323 | nan | 28.5288 | 27.8333 | | deit_base_distilled_patch16_224 | 64 | 0.9993 | 6.4668 | 9.1851 | nan | 28.2649 | 26.8574 | | mixer_b16_224 | 128 | 0.8319 | 5.1491 | 8.846 | nan | 28.217 | 27.7096 | | resmlp_12_224 | 128 | 0.7316 | 4.1849 | 8.1069 | nan | 27.6133 | 26.6144 | | vit_base_patch16_224 | 64 | 0.9837 | 6.9662 | 9.6727 | nan | 27.419 | 25.8685 | | nfnet_l0 | 128 | 1.7713 | 9.6841 | 13.5357 | 150.4424 | 25.937 | 24.5997 | | beit_base_patch16_224 | 64 | 1.3117 | 7.521 | nan | nan | 25.7513 | 24.4593 | | pit_b_224 | 64 | 1.0485 | 7.253 | 10.6045 | nan | 23.773 | 22.3992 | | selecsls42b | 128 | 0.8781 | 5.4673 | 7.6968 | nan | 20.0212 | 18.3774 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9248 | nan | nan | 1.3102 | 1.3732 | | gmlp_s16_224 | 128 | 0.9938 | 0.9495 | nan | nan | 1.2842 | 1.2997 | | poolformer_m36 | 64 | 0.9983 | 0.9433 | 0.3413 | nan | 1.1017 | 1.1171 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0829 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3474 | nan | 1.0595 | 1.1461 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0529 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9692 | 0.8981 | nan | 0.7871 | 1.0336 | 1.1292 | | nfnet_l0 | 128 | 0.9887 | 0.8167 | 0.2677 | 0.5241 | 1.0318 | 1.1803 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | nan | 0.9907 | 1.2281 | | convmixer_768_32 | 32 | 0.9972 | 0.9785 | 0.3446 | nan | 0.9759 | 0.9792 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9749 | 1.0803 | | visformer_small | 128 | 0.9897 | 0.9255 | 0.3467 | nan | 0.9613 | 1.0514 | | dla102 | 128 | 0.9684 | 0.9114 | 0.3365 | nan | 0.9554 | 1.0311 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9932 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9775 | | swsl_resnext101_32x16d | 32 | 0.9988 | 0.8771 | 0.3668 | nan | 0.9094 | 0.9794 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | nan | 0.9073 | 0.9799 | | res2net101_26w_4s | 64 | 0.994 | 0.9149 | 0.3339 | nan | 0.8973 | 0.9734 | | inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | nan | 0.8968 | 1.0255 | | adv_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | nan | 0.8968 | 1.0255 | | gluon_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | nan | 0.8968 | 1.0257 | | gluon_xception65 | 32 | 0.9955 | 0.8848 | 0.3345 | nan | 0.8967 | 0.9753 | | hrnet_w18 | 128 | 0.9914 | 0.9175 | 0.3347 | nan | 0.8966 | 1.0033 | | selecsls42b | 128 | 0.9796 | 0.8773 | 0.3532 | nan | 0.892 | 0.9903 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | nan | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | nan | 0.8911 | 0.8962 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | nan | 0.8765 | 0.9737 | | convnext_base | 64 | 1.003 | 0.9263 | nan | nan | 0.8762 | 0.9864 | | res2next50 | 128 | 0.991 | 0.9094 | 0.32 | nan | 0.871 | 0.9666 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | nan | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | nan | 0.8033 | 1.0353 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | nan | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | nan | 0.6707 | 0.8618 | | botnet26t_256 | 0 | nan | nan | nan | nan | nan | nan | | cspdarknet53 | 0 | nan | nan | nan | nan | nan | nan | | dpn107 | 0 | nan | nan | nan | nan | nan | nan | | eca_botnext26ts_256 | 0 | nan | nan | nan | nan | nan | nan | | eca_halonext26ts | 0 | nan | nan | nan | nan | nan | nan | | ese_vovnet19b_dw | 0 | nan | nan | nan | nan | nan | nan | | fbnetc_100 | 0 | nan | nan | nan | nan | nan | nan | | fbnetv3_b | 0 | nan | nan | nan | nan | nan | nan | | gernet_l | 0 | nan | nan | nan | nan | nan | nan | | ghostnet_100 | 0 | nan | nan | nan | nan | nan | nan | | lcnet_050 | 0 | nan | nan | nan | nan | nan | nan | | mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | mnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv2_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilenetv3_large_100 | 0 | nan | nan | nan | nan | nan | nan | | mobilevit_s | 0 | nan | nan | nan | nan | nan | nan | | pnasnet5large | 0 | nan | nan | nan | nan | nan | nan | | regnety_002 | 0 | nan | nan | nan | nan | nan | nan | | repvgg_a2 | 0 | nan | nan | nan | nan | nan | nan | | rexnet_100 | 0 | nan | nan | nan | nan | nan | nan | | sebotnet33ts_256 | 0 | nan | nan | nan | nan | nan | nan | | spnasnet_100 | 0 | nan | nan | nan | nan | nan | nan | | tf_efficientnet_b0 | 0 | nan | nan | nan | nan | nan | nan | | tf_mixnet_l | 0 | nan | nan | nan | nan | nan | nan | | tinynet_a | 0 | nan | nan | nan | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/BeXBGO3.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/PV4NSdr.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/kSgUz5V.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 53/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 93%, 51/55 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 75%, 41/55 | 49%, 21/43  | 38%, 23/61  |
|    nvprims_nvfuser     | 73%, 40/55 |  16%, 7/43  | 48%, 29/61  |
|        inductor        | 85%, 47/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.06x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.03x    |    1.00x    |    1.15x    |
|        inductor        |   1.42x    |    1.31x    |    1.25x    |
| inductor_no_cudagraphs |   1.25x    |    1.23x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.18    |    2.29     |    1.91     |
|       aot_eager        |    5.85    |    7.55     |    7.04     |
|     aot_cudagraphs     |    7.64    |    16.05    |    13.24    |
|    nvprims_nvfuser     |   77.58    |   133.12    |   149.39    |
|        inductor        |   33.95    |    32.13    |    38.53    |
| inductor_no_cudagraphs |   33.13    |    27.28    |    37.01    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|    nvprims_nvfuser     |   0.81x    |    0.83x    |    0.81x    |
|        inductor        |   0.84x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   0.99x    |    0.97x    |    1.09x    |
+------------------------+------------+-------------+-------------+

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0073 | 1.0232 | 1.7277 | 0.7646 | 4.1785 | 1.4112 | | timm_efficientdet | 1 | 0.9807 | 0.8819 | 0.0 | 0.0 | 3.6985 | 1.6512 | | timm_vision_transformer | 8 | 1.0022 | 0.9303 | 1.5222 | 0.6867 | 2.6024 | 1.3998 | | functorch_dp_cifar10 | 64 | 0.9999 | 1.0313 | 1.5063 | 0.0 | 2.5867 | 1.337 | | BERT_pytorch | 16 | 1.0113 | 0.894 | 0.0 | 1.0072 | 2.1101 | 2.1434 | | drq | 1 | 1.0027 | 0.8861 | 1.3124 | 0.6954 | 1.9886 | 1.0698 | | mobilenet_v3_large | 32 | 1.009 | 1.1235 | 0.8887 | 0.8628 | 1.8272 | 1.419 | | pytorch_struct | 200 | 0.9888 | 0.7486 | 0.8787 | 0.8171 | 1.8073 | 1.1551 | | lennard_jones | 1000 | 0.9618 | 0.8462 | 1.0296 | 0.6983 | 1.7829 | 0.9396 | | hf_T5_large | 2 | 1.0256 | 0.9173 | 0.0 | 0.0 | 1.7014 | 1.6628 | | hf_Albert | 8 | 1.001 | 0.9974 | 0.7514 | 0.0 | 1.6464 | 1.6434 | | resnext50_32x4d | 8 | 0.9968 | 1.1611 | 0.9215 | 0.7651 | 1.6286 | 1.3441 | | speech_transformer | 32 | 1.0054 | 0.9141 | 1.5196 | 0.0 | 1.5384 | 1.5495 | | timm_resnest | 32 | 1.0 | 1.0013 | 0.8056 | 1.1789 | 1.5183 | 1.4527 | | hf_GPT2 | 4 | 1.0099 | 0.9789 | 0.7365 | 0.407 | 1.5031 | 1.5025 | | timm_nfnet | 128 | 0.9996 | 1.0002 | 0.0 | 1.2382 | 1.4784 | 1.4236 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 1.0006 | 0.7661 | 0.877 | 1.4528 | 1.3684 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9992 | 1.076 | 1.0846 | 0.0 | 1.4342 | 1.3487 | | mobilenet_v2_quantized_qat | 96 | 1.0019 | 0.9789 | 0.0 | 1.4488 | 1.431 | 1.4337 | | mobilenet_v2 | 96 | 0.9999 | 0.9998 | 0.7241 | 1.2795 | 1.4307 | 1.4119 | | soft_actor_critic | 256 | 0.9865 | 0.7812 | 1.0656 | 0.6877 | 1.4303 | 0.9423 | | fastNLP_Bert | 6 | 0.9994 | 0.9766 | 0.7519 | 0.8363 | 1.4185 | 1.3933 | | resnet18 | 16 | 1.0038 | 1.1123 | 0.9299 | 0.8793 | 1.398 | 1.2446 | | resnet50_quantized_qat | 32 | 1.0012 | 0.9725 | 0.0 | 1.2075 | 1.3832 | 1.3818 | | mnasnet1_0 | 32 | 1.001 | 1.0528 | 0.8068 | 0.9717 | 1.3744 | 1.2855 | | squeezenet1_1 | 32 | 0.992 | 1.0182 | 0.8271 | 0.8195 | 1.3629 | 1.3261 | | dcgan | 32 | 0.9224 | 1.0147 | 1.0036 | 0.7891 | 1.2791 | 1.038 | | pytorch_stargan | 16 | 0.9994 | 1.083 | 0.9262 | 0.0 | 1.2665 | 1.246 | | hf_Bert | 4 | 1.0306 | 0.9942 | 0.7776 | 0.753 | 1.2126 | 1.1932 | | LearningToPaint | 96 | 0.9991 | 1.0008 | 0.8083 | 0.989 | 1.2125 | 1.2041 | | resnet50 | 32 | 0.999 | 0.9867 | 0.7582 | 1.0986 | 1.2044 | 1.1701 | | hf_Bart | 4 | 1.0117 | 0.9693 | 0.0 | 0.7779 | 1.2025 | 1.2042 | | pytorch_unet | 1 | 0.9996 | 0.998 | 0.8448 | 1.0965 | 1.1977 | 1.1869 | | Super_SloMo | 6 | 0.9999 | 0.9977 | 0.8629 | 0.0 | 1.1801 | 1.1658 | | hf_DistilBert | 8 | 1.0005 | 0.9558 | 0.6883 | 0.4355 | 1.1741 | 1.1824 | | timm_efficientnet | 32 | 0.9508 | 0.8023 | 0.6184 | 0.8531 | 1.1739 | 1.1672 | | vgg16 | 64 | 0.9999 | 0.999 | 0.8588 | 0.9984 | 1.1731 | 1.1685 | | alexnet | 128 | 0.9987 | 0.9979 | 0.8036 | 1.0031 | 1.1632 | 1.1637 | | timm_regnet | 32 | 0.9648 | 0.9611 | 0.7814 | 1.1007 | 1.1274 | 1.0923 | | Background_Matting | 4 | 0.9997 | 1.0229 | 0.8624 | 1.0787 | 1.1182 | 1.1096 | | hf_Reformer | 4 | 0.9966 | 0.0 | 0.9186 | 0.0 | 1.1057 | 1.1296 | | yolov3 | 16 | 0.9998 | 0.9945 | 0.7891 | 1.1953 | 1.0956 | 1.0804 | | hf_BigBird | 2 | 0.9908 | 0.9395 | 0.9582 | 0.9129 | 1.0938 | 1.004 | | attention_is_all_you_need_pytorch | 256 | 1.0 | 0.9718 | 0.0 | 0.7047 | 1.0649 | 1.0504 | | timm_vision_transformer_large | 8 | 0.9995 | 0.9955 | 0.0 | 0.0 | 1.0513 | 1.0355 | | tts_angular | 64 | 0.9874 | 0.9628 | 0.9813 | 0.9696 | 1.0133 | 1.0162 | | timm_vovnet | 32 | 0.907 | 0.9037 | 0.712 | 0.9708 | 1.0032 | 1.017 | | demucs | 4 | 0.9999 | 0.9996 | 1.0005 | 0.9997 | 0.9997 | 1.0 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9627 | 0.5849 | 0.9596 | 0.9038 | 0.9641 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.2258 | | hf_GPT2_large | 4 | 1.0003 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4734 | | hf_T5 | 8 | 1.0007 | 0.9559 | 0.0 | 0.9346 | 0.0 | 1.572 | | tacotron2 | 64 | 0.9723 | 0.8331 | 0.0 | 0.7327 | 0.0 | 0.8931 | | hf_Longformer | 2 | 0.9623 | 0.8889 | 0.8145 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | pass | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8489 | 7.1655 | 10.0729 | 129.6138 | 367.9163 | 359.5926 | | timm_efficientdet | 1 | 19.7716 | 32.9228 | nan | nan | 168.889 | 163.4321 | | hf_T5_large | 2 | 13.7674 | 34.0666 | nan | nan | 113.5007 | 112.3016 | | timm_resnest | 32 | 0.5575 | 2.0447 | 3.0605 | 64.774 | 68.3593 | 64.7382 | | timm_vision_transformer_large | 8 | 2.3404 | 11.0359 | nan | nan | 62.8803 | 62.2697 | | attention_is_all_you_need_pytorch | 256 | 1.1055 | 5.5299 | nan | 159.9519 | 53.4653 | 53.7922 | | timm_vision_transformer | 8 | 0.8181 | 3.4239 | 4.9187 | 90.0021 | 50.1835 | 48.9896 | | pytorch_stargan | 16 | 0.3769 | 1.6957 | 2.4906 | nan | 46.4718 | 46.4481 | | densenet121 | 4 | 2.0713 | 9.9643 | 15.993 | 213.921 | 41.6719 | 40.7468 | | hf_BigBird | 2 | 7.5694 | 12.7372 | 25.5076 | 106.1549 | 38.5595 | 25.6967 | | pytorch_struct | 200 | 0.2378 | 0.624 | 1.1679 | 5.3912 | 35.9907 | 42.2173 | | BERT_pytorch | 16 | 1.4299 | 5.9212 | nan | 138.6767 | 32.6551 | 32.6124 | | hf_Bart | 4 | 1.4107 | 6.8443 | nan | 158.4579 | 30.4235 | 29.7468 | | resnet50_quantized_qat | 32 | 1.1201 | 7.2204 | nan | 217.9893 | 29.4197 | 29.5108 | | mobilenet_v3_large | 32 | 0.8462 | 3.8094 | 5.7607 | 108.4129 | 29.0656 | 29.0382 | | timm_nfnet | 128 | 1.9932 | 6.1896 | nan | 168.1997 | 27.715 | 27.2824 | | fastNLP_Bert | 6 | 1.4734 | 5.3109 | 8.7004 | 124.0749 | 26.9767 | 25.7277 | | hf_Reformer | 4 | 2.3859 | nan | 8.2242 | nan | 26.8341 | 21.6647 | | mobilenet_v2_quantized_qat | 96 | 1.2373 | 7.3022 | nan | 237.3884 | 26.22 | 26.7604 | | timm_regnet | 32 | 2.2424 | 6.6972 | 17.091 | 128.1813 | 25.9689 | 24.4052 | | speech_transformer | 32 | 1.588 | 6.7012 | 46.9473 | nan | 25.4979 | 24.9912 | | timm_efficientnet | 32 | 1.684 | 5.4737 | 13.6715 | 119.6127 | 25.3147 | 24.9302 | | mnasnet1_0 | 32 | 0.7746 | 3.5298 | 5.2275 | 77.9598 | 21.125 | 20.6513 | | resnet50 | 32 | 0.8144 | 3.7974 | 5.6432 | 89.28 | 20.6638 | 20.0396 | | timm_vovnet | 32 | 1.4558 | 3.7036 | 8.7855 | 61.6543 | 20.5533 | 20.48 | | resnext50_32x4d | 8 | 0.8836 | 3.6467 | 5.5029 | 77.1638 | 20.0561 | 18.9227 | | hf_Albert | 8 | 1.0379 | 4.433 | 7.3244 | nan | 18.6881 | 19.1881 | | hf_Bert | 4 | 1.3493 | 5.0893 | 7.9591 | 129.5783 | 18.5199 | 18.0525 | | hf_GPT2 | 4 | 1.3093 | 4.7666 | 7.3543 | 89.4633 | 18.4693 | 17.8065 | | shufflenet_v2_x1_0 | 128 | 0.866 | 4.0718 | 6.2154 | 98.0874 | 17.3625 | 17.2423 | | Super_SloMo | 6 | 1.0023 | 4.0062 | 5.8778 | nan | 16.548 | 15.8267 | | functorch_dp_cifar10 | 64 | 0.3484 | 1.36 | 2.0544 | nan | 16.233 | 15.6188 | | mobilenet_v2 | 96 | 0.7404 | 3.727 | 6.0224 | 111.9783 | 16.2076 | 15.8458 | | Background_Matting | 4 | 0.878 | 3.6952 | 5.7431 | 81.9547 | 16.0097 | 15.2399 | | resnet18 | 16 | 0.3921 | 1.4773 | 2.267 | 32.4778 | 13.2177 | 13.2976 | | hf_DistilBert | 8 | 0.4453 | 2.3835 | 5.3653 | 51.8942 | 13.1569 | 12.9209 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3777 | 1.5419 | 2.4044 | nan | 7.9731 | 7.8363 | | pytorch_unet | 1 | 0.4291 | 1.6622 | 2.4625 | 32.5506 | 7.8373 | 7.607 | | LearningToPaint | 96 | 0.4183 | 1.5464 | 2.3597 | 43.4857 | 6.9894 | 6.8494 | | squeezenet1_1 | 32 | 0.2215 | 0.6644 | 1.013 | 4.5345 | 3.8778 | 3.6721 | | drq | 1 | 0.1415 | 0.3599 | 0.6232 | 4.3376 | 3.8322 | 3.2032 | | nvidia_deeprecommender | 256 | 0.1953 | 0.386 | 0.6409 | 7.3069 | 3.4446 | 3.255 | | vgg16 | 64 | 0.1886 | 0.4731 | 0.827 | 3.1845 | 3.4207 | 3.2322 | | soft_actor_critic | 256 | 0.2012 | 0.2982 | 0.5068 | 2.638 | 3.1945 | 2.7885 | | alexnet | 128 | 0.1554 | 0.3248 | 0.562 | 3.8346 | 2.9603 | 2.6853 | | dcgan | 32 | 0.1749 | 0.3556 | 0.5726 | 4.2682 | 2.7073 | 2.4404 | | lennard_jones | 1000 | 0.1376 | 0.243 | 0.3874 | 2.2109 | 2.0429 | 1.807 | | tts_angular | 64 | 0.2081 | 0.2477 | 0.3761 | 1.046 | 2.0138 | 1.7874 | | demucs | 4 | 0.3104 | 0.3121 | 0.3044 | 0.3082 | 0.2112 | 0.2116 | | tacotron2 | 64 | 17.7027 | 28.7175 | nan | 63.1116 | nan | 67.8091 | | hf_GPT2_large | 4 | 4.9217 | 15.7161 | nan | nan | nan | 42.3709 | | hf_T5 | 8 | 2.273 | 7.6284 | nan | 95.584 | nan | 27.1749 | | dlrm | 2048 | nan | nan | nan | nan | nan | 2.9469 | | hf_Longformer | 2 | 6.1739 | 12.6633 | 55.1143 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9961 | 0.8279 | nan | 0.8271 | 1.5828 | 1.5828 | | resnet50_quantized_qat | 32 | 0.9971 | 0.9148 | nan | 0.8498 | 1.4863 | 1.4864 | | timm_efficientnet | 32 | 0.9932 | 0.7665 | 0.2636 | 0.771 | 1.3111 | 1.3958 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3629 | nan | 1.2027 | 1.4002 | | mobilenet_v2 | 96 | 0.9923 | 0.7624 | 0.3061 | 0.7641 | 1.1741 | 1.2826 | | timm_efficientdet | 1 | 1.0104 | 0.8221 | nan | nan | 1.1174 | 1.1439 | | squeezenet1_1 | 32 | 0.9781 | 0.8163 | 0.3372 | 0.8132 | 1.0821 | 1.1897 | | speech_transformer | 32 | 0.9974 | 0.9159 | 0.2704 | nan | 1.0396 | 1.0448 | | timm_nfnet | 128 | 0.9358 | 0.8937 | nan | 0.879 | 1.0221 | 1.096 | | demucs | 4 | 0.9884 | 0.9884 | 0.9888 | 0.9888 | 0.9888 | 0.9888 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | Background_Matting | 4 | 0.9989 | 0.9483 | 0.3594 | 0.9327 | 0.9822 | 1.0383 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8142 | 0.9812 | 1.0425 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9984 | 0.8586 | 0.3317 | 0.8055 | 0.9374 | 1.0799 | | yolov3 | 16 | 0.9893 | 0.8384 | 0.3319 | 0.8042 | 0.9175 | 1.098 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9106 | 0.3912 | nan | 0.9166 | 1.0148 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8521 | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9966 | 1.009 | 0.4109 | nan | 0.9015 | 1.0694 | | timm_resnest | 32 | 0.9926 | 0.8759 | 0.3223 | 0.7295 | 0.8947 | 0.9967 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9876 | 0.856 | 0.3277 | 0.7754 | 0.8832 | 0.8974 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 1.0 | 0.8879 | 0.3464 | 0.8612 | 0.8616 | 1.006 | | resnet50 | 32 | 0.9945 | 0.8704 | 0.3364 | 0.7953 | 0.8552 | 0.9335 | | mnasnet1_0 | 32 | 0.9878 | 0.8992 | 0.3333 | 0.8256 | 0.8532 | 0.8671 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.8669 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9618 | 0.8779 | nan | 0.8322 | 0.8326 | 1.1284 | | resnext50_32x4d | 8 | 0.9961 | 0.8679 | 0.3584 | 0.8197 | 0.8278 | 0.8346 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | 0.8503 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4299 | 0.9629 | 0.8211 | 1.0392 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8756 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9946 | 0.7591 | 0.3201 | 0.7584 | 0.7591 | 0.9501 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3308 | 0.8714 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9304 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7451 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | 0.857 | 0.7061 | 1.0016 | | LearningToPaint | 96 | 0.9444 | 0.6943 | 0.3408 | 0.6256 | 0.6926 | 0.9371 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6949 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | 0.8674 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | 0.7128 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.3181 | 0.9882 | | tacotron2 | 64 | 0.9906 | 1.0302 | nan | 0.7898 | nan | 1.1621 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8195 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | dlrm | 2048 | nan | nan | nan | nan | nan | 0.7307 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0251 | 0.9037 | 0.0 | 0.0 | 3.1992 | 1.4229 | | CamemBert | 1 | 1.055 | 0.9271 | 1.3287 | 0.0 | 2.6128 | 1.5143 | | MT5ForConditionalGeneration | 8 | 1.0242 | 0.9283 | 0.0 | 1.0172 | 2.3194 | 2.0266 | | DistillGPT2 | 1 | 1.0325 | 0.9137 | 1.1305 | 0.2911 | 2.3133 | 1.8101 | | GoogleFnet | 1 | 0.9968 | 0.8092 | 0.9811 | 0.0 | 2.0991 | 1.1036 | | MobileBertForMaskedLM | 32 | 1.0224 | 0.924 | 0.0 | 0.7988 | 2.0179 | 1.5411 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9772 | 0.0 | 0.6609 | 1.799 | 1.7824 | | T5ForConditionalGeneration | 4 | 1.002 | 0.934 | 0.0 | 0.9176 | 1.4413 | 1.4486 | | ElectraForQuestionAnswering | 64 | 1.0003 | 0.9842 | 0.0 | 0.6306 | 1.4273 | 1.4076 | | ElectraForCausalLM | 32 | 1.0005 | 0.9308 | 0.0 | 0.5019 | 1.4144 | 1.4522 | | M2M100ForConditionalGeneration | 8 | 1.0621 | 0.9669 | 0.8419 | 0.0 | 1.4134 | 1.5781 | | MobileBertForQuestionAnswering | 64 | 1.0202 | 0.9315 | 0.0 | 0.6829 | 1.4131 | 1.3275 | | LayoutLMForSequenceClassification | 16 | 0.9996 | 0.9889 | 0.7374 | 0.6824 | 1.3083 | 1.2919 | | AlbertForQuestionAnswering | 4 | 1.0004 | 1.0009 | 0.0 | 0.0 | 1.2633 | 1.2584 | | T5Small | 1 | 1.0267 | 0.9357 | 0.0 | 0.0 | 1.2629 | 1.2024 | | AlbertForMaskedLM | 4 | 1.0004 | 0.9998 | 0.0 | 0.0 | 1.2577 | 1.257 | | PLBartForConditionalGeneration | 16 | 1.0146 | 0.9637 | 0.0 | 0.768 | 1.2089 | 1.205 | | MegatronBertForCausalLM | 16 | 1.0328 | 1.0061 | 0.7535 | 0.0 | 1.2084 | 1.1967 | | LayoutLMForMaskedLM | 16 | 1.0004 | 0.9684 | 0.0 | 0.6335 | 1.2075 | 1.2145 | | OPTForCausalLM | 32 | 1.0072 | 0.93 | 0.0 | 0.3985 | 1.196 | 1.2028 | | XGLMForCausalLM | 8 | 1.008 | 0.9441 | 0.7354 | 0.3089 | 1.1833 | 1.1758 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.9848 | 0.7116 | 0.3993 | 1.1698 | 1.1508 | | RobertaForCausalLM | 64 | 1.0001 | 0.9635 | 0.7465 | 0.551 | 1.1524 | 1.1571 | | MegatronBertForQuestionAnswering | 16 | 1.0387 | 1.0149 | 0.7607 | 0.0 | 1.138 | 1.1297 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9928 | 0.0 | 0.5646 | 1.1205 | 1.1074 | | Speech2Text2ForCausalLM | 128 | 0.9993 | 0.9256 | 0.6603 | 0.5402 | 1.1202 | 1.1515 | | BertForQuestionAnswering | 128 | 1.0002 | 0.993 | 0.0 | 0.5636 | 1.1141 | 1.1064 | | BartForCausalLM | 4 | 1.0006 | 0.9661 | 0.0 | 0.6744 | 1.1012 | 1.1104 | | BartForConditionalGeneration | 2 | 1.001 | 0.9888 | 0.0 | 0.3996 | 1.0994 | 1.0887 | | MBartForConditionalGeneration | 16 | 1.01 | 0.9918 | 0.0 | 0.0 | 1.095 | 1.0822 | | BigBird | 1 | 0.9952 | 0.9332 | 0.9924 | 0.0 | 1.0945 | 0.9982 | | PegasusForConditionalGeneration | 16 | 1.0134 | 0.9831 | 0.7556 | 0.0 | 1.0907 | 1.1561 | | DebertaForMaskedLM | 4 | 0.9219 | 0.7959 | 0.7876 | 0.5935 | 1.0891 | 1.068 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0011 | 0.9385 | 0.0 | 0.6398 | 1.0658 | 1.0722 | | BertForMaskedLM | 64 | 1.0004 | 0.9611 | 0.7298 | 0.5545 | 1.0606 | 1.0618 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.9508 | 0.713 | 0.4262 | 1.0511 | 1.0686 | | DebertaForQuestionAnswering | 8 | 0.9962 | 0.986 | 0.6827 | 0.8028 | 1.0498 | 1.2209 | | PLBartForCausalLM | 32 | 1.0055 | 0.9357 | 0.7154 | 0.7086 | 1.0315 | 1.0592 | | BlenderbotSmallForCausalLM | 64 | 1.0011 | 0.9093 | 0.6789 | 0.6606 | 1.0076 | 1.0441 | | TrOCRForCausalLM | 32 | 1.0014 | 0.9551 | 0.0 | 0.6635 | 1.0033 | 1.015 | | MBartForCausalLM | 32 | 1.0003 | 0.9542 | 0.0 | 0.0 | 1.0005 | 1.0102 | | PegasusForCausalLM | 32 | 0.9995 | 0.9527 | 0.7325 | 0.0 | 0.9908 | 1.0041 | | AllenaiLongformerBase | 1 | 0.9451 | 0.854 | 0.7835 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8956 | 10.337 | 34.1951 | 90.3752 | 100.163 | 35.3692 | | DebertaForMaskedLM | 4 | 4.8244 | 10.1156 | 34.3051 | 92.7521 | 92.4589 | 33.513 | | XGLMForCausalLM | 8 | 2.3013 | 9.7805 | 35.6756 | 254.1145 | 89.8317 | 87.4544 | | M2M100ForConditionalGeneration | 8 | 2.7016 | 11.2534 | 21.2397 | nan | 69.0088 | 62.6878 | | MobileBertForMaskedLM | 32 | 8.0731 | 23.1962 | nan | 380.1784 | 52.3602 | 49.8162 | | YituTechConvBert | 1 | 2.165 | 7.9687 | nan | nan | 50.7832 | 46.8033 | | MobileBertForQuestionAnswering | 64 | 8.2427 | 23.3046 | nan | 372.3205 | 49.3174 | 48.4199 | | PegasusForConditionalGeneration | 16 | 2.6687 | 11.8855 | 19.9686 | nan | 43.74 | 40.2308 | | BartForConditionalGeneration | 2 | 2.8453 | 12.2438 | nan | 322.8503 | 42.9334 | 41.5559 | | MBartForConditionalGeneration | 16 | 2.844 | 12.1639 | nan | nan | 41.5253 | 39.9457 | | BigBird | 1 | 7.4512 | 12.6672 | 25.5682 | nan | 37.9426 | 24.7829 | | MT5ForConditionalGeneration | 8 | 3.5346 | 10.8536 | nan | 155.0716 | 35.5909 | 34.1696 | | MegatronBertForCausalLM | 16 | 3.0866 | 10.3268 | 17.0033 | nan | 33.6764 | 32.4484 | | T5Small | 1 | 2.362 | 7.341 | nan | nan | 33.3664 | 32.504 | | LayoutLMForSequenceClassification | 16 | 1.7039 | 5.556 | 8.6544 | 126.1203 | 32.8081 | 31.8595 | | MegatronBertForQuestionAnswering | 16 | 2.9833 | 10.3706 | 16.7229 | nan | 32.4536 | 31.3569 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7332 | 8.0057 | nan | 219.6902 | 31.2828 | 29.6068 | | T5ForConditionalGeneration | 4 | 2.2304 | 8.1592 | nan | 97.6869 | 31.1033 | 29.7479 | | PLBartForConditionalGeneration | 16 | 1.4255 | 6.3146 | nan | 169.1636 | 28.133 | 27.4005 | | ElectraForCausalLM | 32 | 1.3912 | 5.1972 | nan | 128.2425 | 27.5733 | 25.655 | | PegasusForCausalLM | 32 | 1.0438 | 4.6344 | 7.6076 | nan | 22.5805 | 21.0309 | | LayoutLMForMaskedLM | 16 | 1.6576 | 5.8324 | nan | 131.0558 | 21.7572 | 20.5314 | | GoogleFnet | 1 | 0.8043 | 2.8333 | 9.0012 | nan | 21.6002 | 13.9097 | | MBartForCausalLM | 32 | 0.9821 | 4.5271 | nan | nan | 21.5314 | 20.6762 | | BertForMaskedLM | 64 | 1.3444 | 5.7224 | 8.0495 | 126.5555 | 20.8265 | 20.1299 | | ElectraForQuestionAnswering | 64 | 1.3301 | 5.2264 | nan | 126.1223 | 20.5882 | 19.8514 | | TrOCRForCausalLM | 32 | 0.9737 | 4.5942 | nan | 122.9787 | 20.1212 | 19.3007 | | BartForCausalLM | 4 | 1.0395 | 4.9357 | nan | 121.8756 | 20.0876 | 19.0403 | | BertForQuestionAnswering | 128 | 1.3653 | 5.7881 | nan | 127.5159 | 19.7063 | 18.6455 | | RobertaForCausalLM | 64 | 1.36 | 5.2319 | 8.0127 | 132.7032 | 19.0196 | 18.3186 | | CamemBert | 1 | 1.3867 | 5.4868 | 7.5134 | nan | 18.463 | 17.7782 | | RobertaForQuestionAnswering | 128 | 1.3941 | 5.3552 | nan | 125.2909 | 18.3863 | 17.4042 | | OPTForCausalLM | 32 | 1.1122 | 4.6357 | nan | 108.5974 | 17.6696 | 17.2356 | | GPT2ForSequenceClassification | 4 | 1.3434 | 4.7845 | nan | 92.1192 | 17.4749 | 16.4404 | | AlbertForMaskedLM | 4 | 1.1425 | 4.5088 | nan | nan | 16.5076 | 15.1489 | | BlenderbotSmallForCausalLM | 64 | 0.6399 | 3.1005 | 5.0757 | 83.9392 | 16.4035 | 15.7496 | | DistillGPT2 | 1 | 0.6538 | 2.7232 | 3.7464 | 47.7299 | 16.2464 | 15.6497 | | AlbertForQuestionAnswering | 4 | 1.1472 | 4.8176 | nan | nan | 16.0833 | 15.05 | | Speech2Text2ForCausalLM | 128 | 0.631 | 2.4866 | 4.0283 | 56.9112 | 15.2296 | 14.2197 | | PLBartForCausalLM | 32 | 0.47 | 2.3911 | 3.7914 | 62.834 | 14.5267 | 13.8415 | | DistilBertForMaskedLM | 64 | 0.4481 | 2.4676 | 5.233 | 54.417 | 12.2822 | 11.9971 | | DistilBertForQuestionAnswering | 64 | 0.4951 | 2.5833 | 5.4864 | 52.0351 | 11.7751 | 11.1209 | | AllenaiLongformerBase | 1 | 6.1295 | 13.0487 | 56.1539 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8819 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | nan | nan | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4212 | nan | 0.8224 | 1.0095 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | 0.8577 | 0.8215 | 1.1049 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8115 | 0.3773 | 0.7597 | 0.8063 | 0.926 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7039 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9858 | 0.8581 | nan | nan | 0.7891 | 0.8729 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | nan | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9983 | 0.9443 | 0.3714 | nan | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | nan | 0.8712 | 0.7627 | 0.9397 | | M2M100ForConditionalGeneration | 8 | 0.9856 | 0.9676 | 0.3674 | nan | 0.7621 | 1.0178 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3613 | nan | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.8428 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8957 | nan | 0.8416 | 0.724 | 0.9375 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.8653 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | 0.8553 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8217 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.8762 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8551 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.7756 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | 0.8587 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.737 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8656 | nan | 0.7894 | 0.6761 | 0.8847 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.669 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.7194 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3642 | 0.7649 | 0.6375 | 0.8974 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | 0.6092 | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | 0.6675 | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3552 | 0.8282 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9998 | 0.9735 | 0.8186 | 0.0 | 1.8707 | 1.8309 | | lcnet_050 | 128 | 0.9561 | 0.9454 | 0.7695 | 1.4323 | 1.6596 | 1.6927 | | convnext_base | 64 | 1.0001 | 0.9991 | 0.0 | 1.3681 | 1.5002 | 1.481 | | dm_nfnet_f0 | 128 | 1.0002 | 1.0009 | 0.0 | 1.2456 | 1.4698 | 1.4258 | | hrnet_w18 | 128 | 0.9999 | 0.9993 | 0.0 | 0.0 | 1.4171 | 1.378 | | dla102 | 128 | 0.9999 | 1.0006 | 0.0 | 0.0 | 1.3839 | 1.3703 | | volo_d1_224 | 64 | 0.9999 | 0.9937 | 0.0 | 0.0 | 1.3817 | 1.361 | | nfnet_l0 | 128 | 0.9998 | 0.7885 | 0.0 | 1.216 | 1.3783 | 1.3291 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2418 | 1.3562 | 1.3244 | | xcit_large_24_p8_224 | 5 | 1.0003 | 0.9722 | 0.0 | 0.0 | 1.3526 | 1.3153 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.9623 | 0.7644 | 1.2986 | 1.3374 | 1.3443 | | mobilenetv2_100 | 128 | 0.9669 | 0.9635 | 0.7072 | 1.2317 | 1.3351 | 1.3558 | | crossvit_9_240 | 128 | 0.9994 | 0.9989 | 0.0 | 0.0 | 1.3275 | 1.3014 | | adv_inception_v3 | 128 | 0.9999 | 0.9992 | 0.0 | 1.1285 | 1.3271 | 1.3086 | | gluon_inception_v3 | 128 | 1.0 | 0.9986 | 0.0 | 1.1261 | 1.3254 | 1.3075 | | inception_v3 | 128 | 0.9999 | 0.9991 | 0.0 | 1.1292 | 1.325 | 1.3082 | | resnest101e | 64 | 1.0002 | 1.0035 | 0.0 | 0.0 | 1.3142 | 1.2718 | | res2next50 | 128 | 0.9999 | 1.0014 | 0.0 | 1.1722 | 1.3128 | 1.2745 | | regnety_002 | 128 | 0.952 | 0.9522 | 0.7554 | 1.0458 | 1.3109 | 1.3175 | | coat_lite_mini | 128 | 0.9998 | 0.9855 | 0.853 | 0.4082 | 1.2882 | 1.2684 | | fbnetv3_b | 128 | 0.9645 | 0.9611 | 0.7616 | 0.0 | 1.281 | 1.2977 | | jx_nest_base | 32 | 0.9998 | 0.9954 | 0.0 | 0.0 | 1.2764 | 1.2496 | | eca_botnext26ts_256 | 128 | 0.9869 | 0.7709 | 0.0 | 0.0 | 1.2688 | 1.2541 | | sebotnet33ts_256 | 64 | 0.9755 | 0.8035 | 0.0 | 0.0 | 1.2687 | 1.2705 | | selecsls42b | 128 | 0.9999 | 1.0002 | 0.8136 | 1.2134 | 1.2673 | 1.253 | | mnasnet_100 | 128 | 0.9662 | 0.9639 | 0.7856 | 1.2555 | 1.265 | 1.2823 | | eca_halonext26ts | 128 | 0.9819 | 0.7779 | 0.0 | 0.0 | 1.2622 | 1.245 | | tf_efficientnet_b0 | 128 | 0.9771 | 0.7836 | 0.0 | 1.2155 | 1.2581 | 1.2643 | | botnet26t_256 | 128 | 0.9857 | 0.9819 | 0.7871 | 0.0 | 1.2551 | 1.2605 | | fbnetc_100 | 128 | 0.9668 | 0.9633 | 0.7788 | 1.2458 | 1.2475 | 1.2651 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.978 | 0.744 | 1.1604 | 1.241 | 1.2463 | | gmixer_24_224 | 128 | 0.9999 | 0.8102 | 0.0 | 0.0 | 1.2378 | 1.2074 | | spnasnet_100 | 128 | 0.961 | 0.9565 | 0.7683 | 1.2252 | 1.2348 | 1.2544 | | res2net101_26w_4s | 64 | 0.9998 | 0.9996 | 0.7725 | 0.0 | 1.2266 | 1.1892 | | cspdarknet53 | 64 | 0.9575 | 0.9538 | 0.7355 | 1.1971 | 1.2246 | 1.2381 | | pnasnet5large | 16 | 0.9995 | 0.9991 | 0.0 | 1.0651 | 1.2096 | 1.1932 | | gmlp_s16_224 | 128 | 0.9999 | 0.9502 | 0.0 | 0.0 | 1.2034 | 1.19 | | twins_pcpvt_base | 64 | 1.0 | 0.9993 | 0.7511 | 0.0 | 1.2001 | 1.1729 | | convit_base | 64 | 0.9998 | 0.9988 | 0.0 | 0.858 | 1.1973 | 1.2427 | | rexnet_100 | 128 | 0.9734 | 0.8161 | 0.0 | 0.0 | 1.1964 | 1.2205 | | tinynet_a | 128 | 0.9661 | 0.7756 | 0.6207 | 0.0 | 1.1905 | 1.201 | | dpn107 | 32 | 0.958 | 0.9496 | 0.7788 | 0.0 | 1.1873 | 1.2008 | | pit_b_224 | 64 | 1.0 | 0.9994 | 0.0 | 0.5954 | 1.187 | 1.1768 | | cait_m36_384 | 4 | 0.9999 | 1.0268 | 0.0 | 0.0 | 1.1825 | 1.155 | | repvgg_a2 | 128 | 0.9651 | 0.9634 | 0.8286 | 1.123 | 1.1728 | 1.1684 | | mobilevit_s | 64 | 0.979 | 0.7624 | 0.0 | 0.0 | 1.1719 | 1.1675 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 0.0 | 1.1679 | 1.1673 | | poolformer_m36 | 64 | 0.9999 | 0.9993 | 0.0 | 0.0 | 1.1657 | 1.1487 | | mixnet_l | 128 | 0.9847 | 0.8861 | 0.0 | 0.0 | 1.1504 | 1.1495 | | swin_base_patch4_window7_224 | 64 | 1.0 | 0.9737 | 0.0 | 0.0 | 1.1435 | 1.1294 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9751 | 0.0 | 0.5264 | 1.1155 | 1.1027 | | swsl_resnext101_32x16d | 32 | 0.9999 | 1.0011 | 0.0 | 0.0 | 1.1113 | 1.0714 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9991 | 0.7671 | 0.5686 | 1.097 | 1.0831 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9989 | 0.7669 | 0.5569 | 1.0887 | 1.0762 | | gluon_xception65 | 32 | 0.9998 | 0.997 | 0.0 | 0.0 | 1.0858 | 1.0747 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 0.0 | 1.078 | 1.0745 | | gernet_l | 128 | 0.9746 | 0.9725 | 0.8238 | 1.0954 | 1.0767 | 1.0715 | | mixer_b16_224 | 128 | 1.0007 | 0.978 | 0.0 | 0.0 | 1.0662 | 1.0608 | | visformer_small | 128 | 0.9997 | 1.0028 | 0.7957 | 0.0 | 1.0474 | 1.0139 | | resmlp_12_224 | 128 | 0.9997 | 0.8553 | 0.6121 | 0.8308 | 0.8236 | 0.8177 | | tnt_s_patch16_224 | 128 | 0.9997 | 0.9988 | 0.0 | 0.0 | 0.0 | 1.544 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilevit_s | 64 | 1.5582 | 6.1603 | nan | nan | 100.5935 | 99.1834 | | hrnet_w18 | 128 | 5.6885 | 24.2802 | nan | nan | 98.9672 | 94.1314 | | swin_base_patch4_window7_224 | 64 | 2.5755 | 10.1773 | nan | nan | 90.1997 | 90.5076 | | twins_pcpvt_base | 64 | 2.1042 | 11.3489 | 19.2713 | nan | 83.4405 | 80.8248 | | xcit_large_24_p8_224 | 5 | 2.7309 | 14.0883 | nan | nan | 81.9185 | 78.8071 | | pnasnet5large | 16 | 4.3354 | 17.9609 | nan | 496.1616 | 74.5354 | 68.9719 | | convnext_base | 64 | 1.3008 | 5.2787 | nan | 171.8505 | 74.1506 | 71.8297 | | jx_nest_base | 32 | 1.7181 | 7.4503 | nan | nan | 67.8265 | 65.6176 | | coat_lite_mini | 128 | 1.0461 | 4.1405 | 6.9103 | 118.1589 | 67.6145 | 65.734 | | cait_m36_384 | 4 | 2.7472 | 14.6491 | nan | nan | 66.5207 | 63.5776 | | resnest101e | 64 | 3.026 | 12.9699 | nan | nan | 63.3945 | 60.8792 | | eca_halonext26ts | 128 | 1.541 | 4.7797 | nan | nan | 61.4961 | 59.8173 | | sebotnet33ts_256 | 64 | 1.5787 | 5.1323 | nan | nan | 59.3935 | 57.7967 | | res2net101_26w_4s | 64 | 2.8808 | 13.5109 | 23.0816 | nan | 51.9751 | 48.4206 | | eca_botnext26ts_256 | 128 | 1.388 | 4.5782 | nan | nan | 50.6242 | 49.4973 | | res2net50_14w_8s | 128 | 2.59 | 12.1506 | nan | 350.7763 | 48.2841 | 44.2782 | | gmlp_s16_224 | 128 | 0.9648 | 5.2154 | nan | nan | 45.3338 | 43.5311 | | poolformer_m36 | 64 | 1.812 | 7.1707 | nan | nan | 44.5393 | 43.1996 | | botnet26t_256 | 128 | 1.3205 | 3.8171 | 8.5159 | nan | 44.3599 | 42.6349 | | crossvit_9_240 | 128 | 1.4023 | 6.4627 | nan | nan | 42.5948 | 41.9221 | | volo_d1_224 | 64 | 1.3277 | 6.2988 | nan | nan | 39.6624 | 36.5146 | | gluon_xception65 | 32 | 1.7603 | 8.754 | nan | nan | 37.9346 | 35.3731 | | dpn107 | 32 | 3.8498 | 11.8286 | 35.3979 | nan | 37.1564 | 35.308 | | fbnetv3_b | 128 | 3.0842 | 9.7118 | 24.4699 | nan | 35.163 | 33.076 | | gmixer_24_224 | 128 | 1.0693 | 5.8601 | nan | nan | 33.7376 | 32.8115 | | gluon_inception_v3 | 128 | 1.5088 | 6.7802 | nan | 175.7757 | 33.5223 | 30.6226 | | inception_v3 | 128 | 1.4685 | 6.9597 | nan | 165.5476 | 32.983 | 30.5189 | | adv_inception_v3 | 128 | 1.509 | 6.819 | nan | 174.4991 | 32.6199 | 31.0616 | | swsl_resnext101_32x16d | 32 | 1.6529 | 7.5758 | nan | nan | 32.5476 | 30.4872 | | ghostnet_100 | 128 | 2.6546 | 8.1326 | 12.834 | nan | 31.5067 | 29.7295 | | tf_mixnet_l | 128 | 5.7237 | 11.2471 | nan | nan | 31.469 | 29.3197 | | convit_base | 64 | 1.0828 | 4.6715 | nan | 158.4036 | 30.3981 | 28.5643 | | dm_nfnet_f0 | 128 | 2.0796 | 6.2375 | nan | 168.2544 | 30.3765 | 27.6032 | | dla102 | 128 | 1.7011 | 7.7523 | nan | nan | 30.0561 | 27.8543 | | mixnet_l | 128 | 5.2984 | 10.8892 | nan | nan | 29.8756 | 28.1475 | | res2next50 | 128 | 1.6025 | 6.7065 | nan | 191.9699 | 27.6307 | 25.6496 | | rexnet_100 | 128 | 1.8091 | 6.1938 | nan | nan | 26.6233 | 24.2758 | | resmlp_12_224 | 128 | 0.6474 | 2.4845 | 5.1631 | 52.3399 | 25.279 | 25.335 | | visformer_small | 128 | 0.927 | 3.4496 | 5.7108 | nan | 25.2652 | 24.2331 | | tinynet_a | 128 | 1.9814 | 6.727 | 17.3512 | nan | 25.1922 | 24.0037 | | convmixer_768_32 | 32 | 1.1081 | 4.9545 | nan | nan | 24.6847 | 23.1114 | | mixer_b16_224 | 128 | 0.6686 | 2.7657 | nan | nan | 24.0861 | 22.9112 | | cspdarknet53 | 64 | 2.1807 | 6.2179 | 16.6754 | 124.3077 | 22.0543 | 20.93 | | tf_efficientnet_b0 | 128 | 1.7469 | 5.8472 | nan | 139.9002 | 21.9476 | 21.5269 | | deit_base_distilled_patch16_224 | 64 | 0.8333 | 3.6123 | 5.8907 | 96.0733 | 21.6979 | 20.3848 | | nfnet_l0 | 128 | 1.777 | 6.2867 | nan | 154.6448 | 21.1439 | 19.7476 | | vit_base_patch16_224 | 64 | 0.8346 | 3.5001 | 5.5971 | 90.0989 | 20.8843 | 19.9694 | | fbnetc_100 | 128 | 1.9747 | 5.5098 | 15.8817 | 125.1091 | 20.8733 | 19.6173 | | spnasnet_100 | 128 | 1.9226 | 5.4386 | 15.0492 | 123.3157 | 20.7434 | 19.4384 | | beit_base_patch16_224 | 64 | 1.1579 | 4.3992 | nan | 119.2622 | 20.2015 | 19.4559 | | mobilenetv3_large_100 | 128 | 1.4833 | 4.7585 | 11.9328 | 131.0168 | 20.0092 | 18.9243 | | pit_b_224 | 64 | 0.9738 | 4.0022 | nan | 124.2408 | 18.4879 | 17.2859 | | mobilenetv2_100 | 128 | 1.5555 | 4.6608 | 11.6278 | 111.1594 | 18.41 | 17.211 | | repvgg_a2 | 128 | 1.9209 | 5.1692 | 13.9113 | 266.477 | 17.8939 | 16.8425 | | mnasnet_100 | 128 | 1.5266 | 4.4056 | 11.7276 | 96.6612 | 17.6517 | 16.664 | | gernet_l | 128 | 1.8803 | 5.1595 | 13.7656 | 102.7942 | 17.4011 | 16.3073 | | regnety_002 | 128 | 1.5245 | 4.6129 | 12.9257 | 101.9683 | 17.3601 | 17.0888 | | selecsls42b | 128 | 0.7604 | 2.9913 | 5.2219 | 79.3065 | 15.4345 | 14.7859 | | lcnet_050 | 128 | 0.9956 | 3.0188 | 6.6127 | 68.2053 | 12.9909 | 12.1311 | | ese_vovnet19b_dw | 128 | 0.9946 | 2.5881 | 5.9498 | 54.049 | 12.5282 | 11.7688 | | tnt_s_patch16_224 | 128 | 1.6447 | 8.1253 | nan | nan | nan | 33.0095 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | nan | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9943 | 0.7798 | 0.2618 | nan | 1.3515 | 1.5856 | | nfnet_l0 | 128 | 0.993 | 0.8275 | nan | 0.8271 | 1.2905 | 1.4934 | | rexnet_100 | 128 | 0.9938 | 0.7848 | nan | nan | 1.2631 | 1.4763 | | tf_efficientnet_b0 | 128 | 0.9936 | 0.7689 | nan | 0.7729 | 1.206 | 1.3823 | | pnasnet5large | 16 | 1.0657 | 1.0089 | nan | 0.9608 | 1.1828 | 1.3351 | | mobilevit_s | 64 | 0.9964 | 0.7671 | nan | nan | 1.1799 | 1.3596 | | mobilenetv2_100 | 128 | 0.9923 | 0.7619 | 0.3063 | 0.7644 | 1.1747 | 1.2829 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1375 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1185 | 1.1746 | | poolformer_m36 | 64 | 0.9983 | 0.9509 | nan | nan | 1.0521 | 1.0698 | | dm_nfnet_f0 | 128 | 0.9357 | 0.894 | nan | 0.8793 | 1.0221 | 1.0963 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9253 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.9131 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.993 | 0.7828 | 0.3098 | nan | 0.9932 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.9157 | 0.9924 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9923 | 1.0856 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9847 | 0.9968 | | ghostnet_100 | 128 | 0.9866 | 0.8768 | 0.3271 | nan | 0.9842 | 1.1252 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | nan | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9955 | 0.8572 | nan | nan | 0.9767 | 1.1453 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | nan | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | nan | nan | 0.9633 | 1.0578 | | dla102 | 128 | 0.9828 | 0.9169 | nan | nan | 0.9625 | 1.0421 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8868 | 0.3259 | 0.8551 | 0.951 | 1.0926 | | gluon_xception65 | 32 | 0.9975 | 0.9358 | nan | nan | 0.9412 | 0.9929 | | mobilenetv3_large_100 | 128 | 0.9874 | 0.8592 | 0.3245 | 0.7755 | 0.941 | 1.0413 | | hrnet_w18 | 128 | 0.9955 | 0.9252 | nan | nan | 0.9382 | 1.0121 | | spnasnet_100 | 128 | 0.9885 | 0.9103 | 0.3308 | 0.8383 | 0.9379 | 0.9927 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9022 | 0.3305 | 0.8252 | 0.9325 | 0.9921 | | res2net101_26w_4s | 64 | 0.9967 | 0.9278 | 0.3243 | nan | 0.93 | 1.0167 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9154 | 0.9655 | | cspdarknet53 | 64 | 0.9954 | 0.8613 | 0.3159 | 0.8261 | 0.9148 | 1.0666 | | adv_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.914 | 1.063 | | gluon_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9138 | 1.063 | | inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9138 | 1.063 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.867 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9955 | 0.9149 | nan | 0.8461 | 0.9075 | 1.0161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0516 | | mixnet_l | 128 | 0.995 | 0.845 | nan | nan | 0.9069 | 1.0619 | | fbnetc_100 | 128 | 0.989 | 0.8518 | 0.3236 | 0.7416 | 0.9049 | 0.9971 | | dpn107 | 32 | 0.9986 | 0.9268 | 0.3389 | nan | 0.9047 | 0.9908 | | visformer_small | 128 | 0.9944 | 0.9374 | 0.3291 | nan | 0.9029 | 0.9934 | | selecsls42b | 128 | 0.9885 | 0.8897 | 0.337 | 0.8775 | 0.8987 | 1.0049 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.8965 | nan | nan | 0.8913 | 0.9925 | | res2net50_14w_8s | 128 | 0.995 | 0.9047 | nan | 0.8422 | 0.8821 | 1.0211 | | regnety_002 | 128 | 0.9718 | 0.8105 | 0.3284 | 0.7203 | 0.8619 | 1.0399 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9624 | | pit_b_224 | 64 | 0.9968 | 0.7946 | nan | 0.7502 | 0.8563 | 1.0753 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7251 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7891 | 0.32 | 0.7965 | 0.7928 | 0.9932 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6276 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6552 | 0.7684 | 0.9903 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8573 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | nan | nan | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/2pfbweB.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/Btw6DTN.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/FJlB4r8.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 52/53 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 52/53 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 77%, 41/53 | 60%, 25/42  | 79%, 48/61  |
|    nvprims_nvfuser     | 49%, 26/53 |  12%, 5/42  | 33%, 20/61  |
|        inductor        | 87%, 46/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 48/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.05x    |    1.00x    |
|    nvprims_nvfuser     |   1.03x    |    1.00x    |    1.10x    |
|        inductor        |   1.69x    |    1.76x    |    1.40x    |
| inductor_no_cudagraphs |   1.39x    |    1.54x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.35    |    2.71     |    2.12     |
|       aot_eager        |    6.89    |    10.10    |    8.51     |
|     aot_cudagraphs     |   10.32    |    17.98    |    16.82    |
|    nvprims_nvfuser     |   65.64    |   126.51    |   163.37    |
|        inductor        |   34.12    |    36.94    |    44.51    |
| inductor_no_cudagraphs |   33.99    |    32.16    |    42.70    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.39x    |    0.32x    |
|    nvprims_nvfuser     |   0.75x    |    0.79x    |    0.69x    |
|        inductor        |   0.84x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.97x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0033 | 0.9015 | 1.879 | 0.658 | 5.1575 | 1.4628 | | timm_efficientdet | 1 | 0.9829 | 0.784 | 0.0 | 0.0 | 4.2627 | 1.7504 | | functorch_dp_cifar10 | 64 | 1.0029 | 0.9424 | 1.714 | 0.0 | 3.5578 | 1.5599 | | BERT_pytorch | 16 | 1.0073 | 0.838 | 0.0 | 0.0 | 3.5 | 2.3651 | | timm_vision_transformer | 8 | 1.008 | 0.8477 | 1.7402 | 0.6264 | 3.4415 | 1.5219 | | hf_T5_large | 2 | 1.0234 | 0.8644 | 0.0 | 0.0 | 2.5838 | 2.1275 | | drq | 1 | 1.0155 | 0.7852 | 1.6041 | 0.574 | 2.4461 | 1.1928 | | mobilenet_v3_large | 32 | 1.0042 | 1.0022 | 1.1931 | 0.7416 | 2.4114 | 1.5635 | | hf_Albert | 8 | 1.0004 | 0.9556 | 0.774 | 0.0 | 2.3867 | 2.3231 | | resnext50_32x4d | 8 | 1.0009 | 0.9483 | 1.3076 | 0.6529 | 2.3817 | 1.3673 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9734 | 1.4081 | 0.0 | 2.2824 | 1.8116 | | mnasnet1_0 | 32 | 1.0022 | 1.0188 | 1.0143 | 0.0 | 2.1057 | 1.5323 | | pytorch_struct | 200 | 0.9837 | 0.7397 | 1.0036 | 0.6332 | 2.0479 | 1.2826 | | lennard_jones | 1000 | 0.9643 | 0.775 | 1.2532 | 0.4807 | 2.0329 | 1.0328 | | hf_Bert | 4 | 1.0342 | 0.8487 | 0.9376 | 0.0 | 2.0194 | 1.8348 | | resnet18 | 16 | 1.0022 | 1.0255 | 1.1343 | 0.723 | 1.9614 | 1.4095 | | hf_GPT2 | 4 | 1.0226 | 0.9814 | 0.0 | 0.302 | 1.9376 | 1.9022 | | hf_T5 | 8 | 1.0006 | 0.927 | 0.0 | 1.1448 | 1.8844 | 1.8904 | | timm_resnest | 32 | 0.9984 | 0.987 | 0.8117 | 0.0 | 1.8604 | 1.7152 | | squeezenet1_1 | 32 | 0.9951 | 0.9614 | 1.0944 | 0.6845 | 1.8428 | 1.4747 | | hf_Bart | 4 | 1.0097 | 0.8298 | 0.0 | 0.0 | 1.7867 | 1.6802 | | dcgan | 32 | 0.9708 | 0.9014 | 1.1358 | 0.6097 | 1.7437 | 1.0543 | | speech_transformer | 32 | 1.0056 | 0.8371 | 1.9283 | 0.0 | 1.7184 | 1.6755 | | soft_actor_critic | 256 | 0.9769 | 0.7513 | 1.3061 | 0.5475 | 1.6958 | 1.0265 | | timm_efficientnet | 32 | 0.9532 | 0.7745 | 0.8631 | 0.0 | 1.6907 | 1.3531 | | mobilenet_v2 | 96 | 1.0002 | 0.9897 | 0.7617 | 1.2066 | 1.5597 | 1.5169 | | attention_is_all_you_need_pytorch | 256 | 1.0104 | 0.9086 | 0.0 | 0.6226 | 1.5568 | 1.496 | | fastNLP_Bert | 6 | 0.9984 | 0.897 | 0.7655 | 0.0 | 1.5295 | 1.4735 | | shufflenet_v2_x1_0 | 128 | 1.0004 | 1.0426 | 0.8739 | 0.8683 | 1.5153 | 1.3971 | | hf_DistilBert | 8 | 1.0013 | 0.973 | 0.7425 | 0.2927 | 1.515 | 1.4858 | | timm_nfnet | 128 | 0.9995 | 0.9989 | 0.0 | 1.0795 | 1.506 | 1.432 | | LearningToPaint | 96 | 1.0041 | 1.0196 | 0.9995 | 0.8082 | 1.4982 | 1.3456 | | pytorch_stargan | 16 | 0.9993 | 1.1356 | 0.9659 | 0.0 | 1.3615 | 1.3032 | | resnet50 | 32 | 1.0033 | 1.0528 | 0.8149 | 0.7972 | 1.3547 | 1.2955 | | pytorch_unet | 1 | 0.9993 | 0.9925 | 0.8623 | 1.0723 | 1.3392 | 1.315 | | Super_SloMo | 6 | 0.9993 | 0.9952 | 0.885 | 0.0 | 1.2876 | 1.2582 | | vgg16 | 64 | 0.9997 | 0.9976 | 0.8576 | 0.974 | 1.2688 | 1.2597 | | Background_Matting | 4 | 0.9999 | 1.0189 | 0.8911 | 1.0652 | 1.2363 | 1.2197 | | alexnet | 128 | 0.9988 | 0.9969 | 0.8151 | 0.9335 | 1.2092 | 1.2082 | | hf_Reformer | 4 | 0.995 | 0.9912 | 0.9377 | 0.0 | 1.1598 | 1.1514 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9902 | 0.0 | 0.0 | 1.1532 | 1.1378 | | hf_BigBird | 2 | 0.9907 | 0.9096 | 1.054 | 0.8327 | 1.1478 | 1.0279 | | timm_regnet | 32 | 0.9411 | 0.9291 | 0.7688 | 0.0 | 1.1262 | 1.0698 | | yolov3 | 16 | 0.9996 | 0.9897 | 0.803 | 0.0 | 1.0905 | 1.0668 | | timm_vovnet | 32 | 0.9043 | 0.8799 | 0.7146 | 0.7829 | 1.0803 | 1.1085 | | tts_angular | 64 | 0.9731 | 0.9756 | 0.9951 | 0.9672 | 1.0128 | 1.0277 | | demucs | 4 | 0.9999 | 0.9994 | 1.0013 | 0.9998 | 0.9986 | 0.9994 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9958 | 0.6966 | 1.0189 | 0.9892 | 1.0307 | | dlrm | 2048 | 1.0222 | 1.0711 | 0.0 | 1.191 | 0.9342 | 1.1762 | | hf_GPT2_large | 4 | 1.0001 | 0.9904 | 0.0 | 0.0 | 0.0 | 1.8631 | | tacotron2 | 64 | 0.9753 | 0.7389 | 0.9491 | 0.5731 | 0.0 | 0.8558 | | hf_Longformer | 2 | 0.955 | 0.8718 | 0.8877 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9713 | 8.3674 | 11.7191 | nan | 409.6973 | 406.6369 | | timm_efficientdet | 1 | 20.024 | 37.0942 | nan | nan | 179.7111 | 175.6608 | | hf_T5_large | 2 | 14.2026 | 39.4361 | nan | nan | 126.4199 | 124.3456 | | timm_vision_transformer_large | 8 | 2.8889 | 15.0787 | nan | nan | 79.604 | 78.1195 | | timm_resnest | 32 | 0.6084 | 2.5009 | 3.8258 | nan | 64.0562 | 62.7377 | | timm_vision_transformer | 8 | 0.9471 | 4.7127 | 6.6947 | 90.06 | 49.2749 | 50.6852 | | pytorch_stargan | 16 | 0.4097 | 1.9637 | 2.9185 | nan | 48.4463 | 49.8046 | | densenet121 | 4 | 2.2703 | 11.8411 | 18.9168 | 376.4671 | 47.8256 | 46.8549 | | hf_BigBird | 2 | 8.4465 | 15.2041 | 30.5236 | 118.1466 | 44.8799 | 30.4957 | | pytorch_struct | 200 | 0.2811 | 0.8578 | 1.4773 | 6.9564 | 44.4045 | 33.9485 | | attention_is_all_you_need_pytorch | 256 | 1.3585 | 7.1032 | nan | 167.6995 | 43.8107 | 43.4258 | | BERT_pytorch | 16 | 1.7048 | 7.5884 | nan | nan | 39.184 | 38.7295 | | hf_Bart | 4 | 1.8035 | 8.6912 | nan | nan | 34.402 | 33.0668 | | timm_nfnet | 128 | 2.0936 | 7.1208 | nan | 175.6323 | 30.4647 | 30.0212 | | mobilenet_v3_large | 32 | 0.9644 | 4.8053 | 7.1774 | 148.6136 | 30.2717 | 30.8032 | | hf_T5 | 8 | 2.4517 | 8.593 | nan | 100.186 | 30.2513 | 29.173 | | timm_regnet | 32 | 2.3968 | 7.9476 | 19.6104 | nan | 30.2474 | 28.8237 | | fastNLP_Bert | 6 | 1.7453 | 7.1187 | 11.4851 | nan | 29.3975 | 27.5882 | | timm_efficientnet | 32 | 1.8668 | 6.8061 | 15.8491 | nan | 28.8516 | 28.3279 | | hf_Reformer | 4 | 2.539 | 4.6303 | 9.0285 | nan | 26.2671 | 20.8228 | | speech_transformer | 32 | 1.9443 | 8.854 | 67.9039 | nan | 26.244 | 25.6631 | | mnasnet1_0 | 32 | 0.8713 | 4.3485 | 6.4054 | nan | 23.0623 | 22.6232 | | resnext50_32x4d | 8 | 0.9595 | 4.6943 | 6.8338 | 111.0897 | 22.6681 | 22.1341 | | hf_Albert | 8 | 1.4044 | 6.3824 | 9.9541 | nan | 22.668 | 22.0712 | | resnet50 | 32 | 0.9199 | 4.4905 | 6.7999 | 126.5213 | 22.5891 | 21.4958 | | timm_vovnet | 32 | 1.5302 | 4.4534 | 9.8487 | 78.1553 | 22.3564 | 21.2672 | | hf_GPT2 | 4 | 1.5199 | 6.3104 | nan | 80.551 | 21.7027 | 20.6948 | | hf_Bert | 4 | 1.644 | 6.8775 | 10.1778 | nan | 20.6561 | 19.8766 | | shufflenet_v2_x1_0 | 128 | 1.0105 | 4.9662 | 7.4929 | 123.037 | 19.7592 | 19.3098 | | Super_SloMo | 6 | 1.0852 | 4.8313 | 6.6619 | nan | 18.8964 | 18.4435 | | Background_Matting | 4 | 0.9799 | 4.6047 | 6.8346 | 97.1309 | 18.7294 | 17.7211 | | mobilenet_v2 | 96 | 0.8665 | 4.7998 | 6.9049 | 149.54 | 18.6977 | 18.2693 | | functorch_dp_cifar10 | 64 | 0.3841 | 1.6562 | 2.424 | nan | 16.9141 | 16.4272 | | hf_DistilBert | 8 | 0.6349 | 3.306 | 7.3817 | 56.1113 | 14.8904 | 14.5361 | | resnet18 | 16 | 0.4511 | 1.8 | 2.8353 | 43.4305 | 14.2506 | 13.7885 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4364 | 1.9584 | 2.8261 | nan | 8.9843 | 8.5537 | | pytorch_unet | 1 | 0.485 | 2.0033 | 3.0229 | 42.8064 | 8.9754 | 8.5128 | | LearningToPaint | 96 | 0.4625 | 1.9308 | 3.0254 | 55.0197 | 7.7748 | 7.5429 | | squeezenet1_1 | 32 | 0.2585 | 0.9448 | 1.4174 | 7.1522 | 4.796 | 4.3569 | | vgg16 | 64 | 0.198 | 0.6611 | 1.1219 | 5.8264 | 4.1012 | 3.8904 | | drq | 1 | 0.165 | 0.4871 | 0.798 | 6.4569 | 4.0162 | 3.4631 | | nvidia_deeprecommender | 256 | 0.2208 | 0.5286 | 0.8198 | 6.4682 | 3.7861 | 3.5008 | | dlrm | 2048 | 0.4628 | 0.874 | nan | 5.8681 | 3.7175 | 3.2881 | | alexnet | 128 | 0.1703 | 0.4447 | 0.7414 | 5.4751 | 3.2601 | 3.1369 | | soft_actor_critic | 256 | 0.2185 | 0.3621 | 0.5925 | 3.8228 | 3.2185 | 2.864 | | dcgan | 32 | 0.1753 | 0.4302 | 0.664 | 6.0202 | 2.8705 | 2.6155 | | lennard_jones | 1000 | 0.1577 | 0.3527 | 0.5251 | 4.0354 | 2.1371 | 1.9461 | | tts_angular | 64 | 0.2293 | 0.2873 | 0.4036 | 1.4699 | 1.9231 | 1.7 | | demucs | 4 | 0.3523 | 0.3532 | 0.3514 | 0.3469 | 0.2589 | 0.2609 | | tacotron2 | 64 | 18.1465 | 32.8969 | 51.0274 | 106.4111 | nan | 67.4216 | | hf_GPT2_large | 4 | 5.4904 | 20.0336 | nan | nan | nan | 52.4587 | | hf_Longformer | 2 | 6.4357 | 14.1161 | 58.0458 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.9892 | 0.7707 | 0.272 | nan | 1.2042 | 1.2299 | | hf_Albert | 8 | 0.9814 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3316 | nan | 1.1083 | 1.1145 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.312 | 0.5997 | 1.0603 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3842 | nan | 1.0536 | 1.2945 | | timm_nfnet | 128 | 0.9691 | 0.8985 | nan | 0.7873 | 1.0337 | 1.1293 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | 0.7853 | 1.0179 | 1.1759 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | Background_Matting | 4 | 1.0059 | 0.9548 | 0.3708 | 0.9233 | 0.9831 | 1.0338 | | timm_efficientdet | 1 | 1.0254 | 0.8401 | nan | nan | 0.9822 | 1.011 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | nan | nan | 0.9743 | 1.1226 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | 0.8601 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0038 | 0.8727 | 0.424 | nan | 0.956 | 1.0082 | | timm_regnet | 32 | 0.9945 | 0.8449 | 0.35 | nan | 0.9372 | 1.0312 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 0.9059 | 0.9309 | 1.252 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.7273 | 0.911 | 1.0853 | | yolov3 | 16 | 0.985 | 0.8338 | 0.3518 | nan | 0.901 | 1.0402 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | nan | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9875 | 0.8721 | 0.3485 | nan | 0.876 | 0.9969 | | densenet121 | 4 | 0.9883 | 0.866 | 0.3662 | 0.7966 | 0.876 | 1.0026 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3903 | nan | 0.8735 | 0.942 | | squeezenet1_1 | 32 | 0.9595 | 0.7951 | 0.346 | 0.5757 | 0.8731 | 1.0627 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8419 | 0.3593 | 0.6692 | 0.8727 | 0.9966 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | resnet50 | 32 | 0.9888 | 0.8617 | 0.3557 | 0.733 | 0.8646 | 0.8839 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3413 | 0.8347 | 0.8384 | 0.9049 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.6247 | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9102 | 0.831 | nan | nan | 0.8232 | 0.9878 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.454 | 0.9208 | 0.8111 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.7444 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8503 | 0.3454 | 0.6025 | 0.7902 | 0.816 | | pytorch_stargan | 16 | 0.9952 | 0.9707 | 0.4259 | nan | 0.7794 | 0.8862 | | timm_vovnet | 32 | 0.9895 | 0.7676 | 0.3403 | 0.7217 | 0.7791 | 0.8856 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.6125 | 0.7633 | 1.0588 | | resnext50_32x4d | 8 | 0.9947 | 0.8545 | 0.3881 | 0.7725 | 0.7622 | 0.7746 | | mnasnet1_0 | 32 | 0.9788 | 0.8617 | 0.3406 | nan | 0.7529 | 0.7734 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8295 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9245 | 0.7242 | 0.386 | 0.5479 | 0.7376 | 0.9252 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3917 | 0.8515 | 0.7151 | 0.7249 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | 0.704 | 0.7306 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3941 | 0.5571 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4353 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1354 | | tacotron2 | 64 | 0.9866 | 0.3963 | 0.3142 | 0.3471 | nan | 0.4114 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3492 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0221 | 0.8199 | 0.0 | 0.0 | 4.5066 | 1.6121 | | MobileBertForMaskedLM | 32 | 1.0183 | 0.8334 | 0.0 | 0.0 | 4.3323 | 1.7903 | | MobileBertForQuestionAnswering | 64 | 1.0169 | 0.83 | 0.0 | 0.7011 | 3.8516 | 1.777 | | CamemBert | 1 | 1.0421 | 0.8354 | 1.7249 | 0.0 | 3.5127 | 1.7685 | | MT5ForConditionalGeneration | 8 | 1.0212 | 0.8677 | 0.0 | 0.0 | 3.5119 | 2.4704 | | DistillGPT2 | 1 | 1.0307 | 0.8621 | 1.2687 | 0.2455 | 2.6113 | 1.9833 | | M2M100ForConditionalGeneration | 8 | 1.0116 | 0.8214 | 1.4564 | 0.0 | 2.4789 | 1.8822 | | GPT2ForSequenceClassification | 4 | 1.0008 | 0.9757 | 0.0 | 0.5013 | 2.3159 | 2.2783 | | ElectraForQuestionAnswering | 64 | 1.0006 | 0.9777 | 0.768 | 0.0 | 2.1282 | 2.0624 | | MegatronBertForQuestionAnswering | 16 | 1.0346 | 0.8578 | 1.0423 | 0.0 | 2.024 | 1.7733 | | PLBartForConditionalGeneration | 16 | 1.0108 | 0.8306 | 0.0 | 0.0 | 1.9515 | 1.8516 | | LayoutLMForSequenceClassification | 16 | 1.0004 | 0.9816 | 0.7762 | 0.0 | 1.8633 | 1.7994 | | MegatronBertForCausalLM | 16 | 1.0322 | 0.8573 | 0.9695 | 0.0 | 1.8368 | 1.7762 | | ElectraForCausalLM | 32 | 1.0002 | 0.9404 | 0.7162 | 0.0 | 1.8274 | 1.8311 | | XGLMForCausalLM | 8 | 1.0108 | 0.8248 | 0.9277 | 0.2989 | 1.7331 | 1.5708 | | MBartForConditionalGeneration | 16 | 1.0137 | 0.8403 | 0.0 | 0.0 | 1.6952 | 1.6076 | | PegasusForConditionalGeneration | 16 | 1.0107 | 0.8194 | 0.9253 | 0.0 | 1.6819 | 1.5327 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8857 | 0.0 | 0.0 | 1.6653 | 1.6546 | | AlbertForMaskedLM | 4 | 1.0 | 0.8852 | 0.0 | 0.0 | 1.6552 | 1.6455 | | T5Small | 1 | 1.0269 | 0.8661 | 0.0 | 0.0 | 1.6438 | 1.3889 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9711 | 0.7577 | 0.0 | 1.6394 | 1.6269 | | T5ForConditionalGeneration | 4 | 1.0069 | 0.918 | 0.0 | 0.972 | 1.6131 | 1.5824 | | Speech2Text2ForCausalLM | 128 | 1.004 | 0.9481 | 0.7164 | 0.0 | 1.5728 | 1.5734 | | OPTForCausalLM | 32 | 1.0091 | 0.926 | 0.0 | 0.3036 | 1.5454 | 1.5229 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.975 | 0.7806 | 0.0 | 1.5067 | 1.4822 | | DistilBertForQuestionAnswering | 64 | 1.001 | 0.9688 | 0.7318 | 0.3004 | 1.4953 | 1.4477 | | BertForQuestionAnswering | 128 | 1.0001 | 0.9835 | 0.7793 | 0.0 | 1.4948 | 1.467 | | BartForConditionalGeneration | 2 | 1.0058 | 0.969 | 0.0 | 0.3126 | 1.4559 | 1.4238 | | RobertaForCausalLM | 64 | 0.9998 | 0.9568 | 0.7537 | 0.0 | 1.447 | 1.4291 | | BartForCausalLM | 4 | 1.0005 | 0.9689 | 0.0 | 0.0 | 1.4452 | 1.4433 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.8837 | 0.0 | 0.0 | 1.4334 | 1.4404 | | BertForMaskedLM | 64 | 0.9996 | 0.9576 | 0.7413 | 0.0 | 1.3468 | 1.3349 | | DebertaForMaskedLM | 4 | 0.9288 | 0.7209 | 0.9227 | 0.0 | 1.3198 | 1.1647 | | PLBartForCausalLM | 32 | 1.0072 | 0.9412 | 0.8259 | 0.0 | 1.2835 | 1.2729 | | DistilBertForMaskedLM | 64 | 1.0001 | 0.9507 | 0.709 | 0.3203 | 1.2646 | 1.2673 | | BlenderbotSmallForCausalLM | 64 | 1.0016 | 0.927 | 0.7033 | 0.0 | 1.2613 | 1.2735 | | MBartForCausalLM | 32 | 1.0022 | 0.9401 | 0.0 | 0.0 | 1.1959 | 1.1955 | | TrOCRForCausalLM | 32 | 1.0019 | 0.9497 | 0.0 | 0.0 | 1.1957 | 1.1904 | | PegasusForCausalLM | 32 | 1.0016 | 0.9532 | 0.7565 | 0.0 | 1.1839 | 1.1837 | | BigBird | 1 | 0.9886 | 0.914 | 1.0424 | 0.0 | 1.1494 | 1.0292 | | DebertaForQuestionAnswering | 8 | 0.9931 | 0.877 | 0.7235 | 0.0 | 1.1433 | 1.2362 | | AllenaiLongformerBase | 1 | 0.9448 | 0.723 | 0.9251 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.8386 | 13.4509 | 52.7408 | 268.4097 | 103.6738 | 101.2483 | | DebertaForQuestionAnswering | 8 | 5.0106 | 11.1306 | 35.0909 | nan | 100.5812 | 39.8099 | | DebertaForMaskedLM | 4 | 5.0193 | 11.0302 | 35.6386 | nan | 98.1649 | 37.1033 | | M2M100ForConditionalGeneration | 8 | 3.6096 | 17.2024 | 21.5421 | nan | 75.6437 | 69.346 | | MobileBertForMaskedLM | 32 | 9.2447 | 32.0226 | nan | nan | 69.1529 | 66.9361 | | MobileBertForQuestionAnswering | 64 | 9.3271 | 32.3888 | nan | 478.9142 | 66.7718 | 65.2175 | | YituTechConvBert | 1 | 2.5274 | 10.8882 | nan | nan | 54.8605 | 52.5258 | | PegasusForConditionalGeneration | 16 | 3.3584 | 16.3663 | 27.4382 | nan | 53.954 | 49.8043 | | BartForConditionalGeneration | 2 | 3.476 | 16.6584 | nan | 380.351 | 50.6717 | 49.5775 | | MBartForConditionalGeneration | 16 | 3.5675 | 17.0508 | nan | nan | 50.6574 | 49.4499 | | BigBird | 1 | 8.5459 | 15.4481 | 30.1746 | nan | 45.2071 | 28.8193 | | MegatronBertForCausalLM | 16 | 3.3972 | 14.0672 | 22.167 | nan | 41.3604 | 40.496 | | MT5ForConditionalGeneration | 8 | 3.6662 | 12.7706 | nan | nan | 41.3204 | 39.6593 | | MegatronBertForQuestionAnswering | 16 | 3.591 | 14.2502 | 22.3078 | nan | 40.506 | 39.5391 | | T5Small | 1 | 2.4065 | 8.7321 | nan | nan | 37.1599 | 35.2934 | | BlenderbotSmallForConditionalGeneration | 64 | 2.1629 | 11.3975 | nan | nan | 36.1284 | 35.1947 | | LayoutLMForSequenceClassification | 16 | 2.0479 | 7.357 | 11.1129 | nan | 33.5797 | 32.7534 | | T5ForConditionalGeneration | 4 | 2.4315 | 8.8742 | nan | 105.3033 | 32.268 | 31.0492 | | PLBartForConditionalGeneration | 16 | 1.7568 | 8.8891 | nan | nan | 31.8842 | 31.2405 | | ElectraForCausalLM | 32 | 1.7312 | 7.0514 | 10.9909 | nan | 28.6975 | 26.5282 | | PegasusForCausalLM | 32 | 1.2937 | 6.3983 | 10.1326 | nan | 26.3373 | 24.3264 | | MBartForCausalLM | 32 | 1.2474 | 6.4951 | nan | nan | 24.3524 | 23.9492 | | TrOCRForCausalLM | 32 | 1.2809 | 6.3843 | nan | nan | 23.8198 | 23.0001 | | BartForCausalLM | 4 | 1.2954 | 6.323 | nan | nan | 23.7137 | 22.7998 | | LayoutLMForMaskedLM | 16 | 2.1008 | 7.3941 | 11.5062 | nan | 23.657 | 22.6423 | | OPTForCausalLM | 32 | 1.293 | 6.4716 | nan | 110.7435 | 23.4477 | 22.2867 | | ElectraForQuestionAnswering | 64 | 1.7064 | 7.1746 | 10.7336 | nan | 22.9413 | 21.9198 | | BertForMaskedLM | 64 | 1.5728 | 6.9965 | 10.5253 | nan | 22.7582 | 22.4665 | | BertForQuestionAnswering | 128 | 1.6553 | 6.9369 | 10.6148 | nan | 21.5664 | 20.4151 | | RobertaForCausalLM | 64 | 1.5889 | 6.9391 | 10.5673 | nan | 21.4472 | 20.9208 | | CamemBert | 1 | 1.7039 | 6.9968 | 10.2171 | nan | 20.7383 | 19.9558 | | GPT2ForSequenceClassification | 4 | 1.5266 | 6.5816 | nan | 85.2969 | 20.1007 | 19.9004 | | RobertaForQuestionAnswering | 128 | 1.6248 | 7.164 | 10.7512 | nan | 19.7014 | 18.8167 | | AlbertForMaskedLM | 4 | 1.4685 | 6.5569 | nan | nan | 19.6377 | 18.1707 | | BlenderbotSmallForCausalLM | 64 | 0.8162 | 4.1574 | 6.623 | nan | 19.3261 | 19.0014 | | AlbertForQuestionAnswering | 4 | 1.4323 | 6.4788 | nan | nan | 18.7365 | 17.4479 | | DistillGPT2 | 1 | 0.7674 | 3.2257 | 4.5629 | 54.7875 | 17.7285 | 16.9032 | | Speech2Text2ForCausalLM | 128 | 0.6714 | 3.3647 | 5.1292 | nan | 17.4007 | 15.786 | | PLBartForCausalLM | 32 | 0.6607 | 3.4 | 5.1422 | nan | 16.3158 | 15.9174 | | DistilBertForMaskedLM | 64 | 0.627 | 3.3849 | 7.1756 | 57.1949 | 13.8729 | 13.6807 | | DistilBertForQuestionAnswering | 64 | 0.6121 | 3.4406 | 7.415 | 54.9303 | 13.3326 | 12.9348 | | AllenaiLongformerBase | 1 | 6.9574 | 14.8217 | 59.272 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | 0.8823 | 1.0775 | 1.1632 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | 0.8978 | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 0.9708 | 1.0363 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | 0.8385 | 0.9662 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.8865 | nan | nan | 0.9567 | 1.1277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9417 | 1.0114 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 0.999 | 0.8994 | 0.3788 | nan | 0.9289 | 0.9789 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3467 | 0.81 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | 0.6772 | 0.925 | 1.0061 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.921 | 0.9877 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | nan | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0276 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | nan | nan | 0.8843 | 1.0294 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.3635 | 0.6373 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.8751 | 0.919 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.856 | 0.9308 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3979 | nan | 0.8549 | 0.9361 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | BigBird | 1 | 1.0008 | 0.9533 | 0.4474 | nan | 0.8178 | 1.0873 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8061 | 0.9309 | | XGLMForCausalLM | 8 | 0.9918 | 0.9231 | 0.4336 | 0.7755 | 0.8055 | 0.9902 | | DistillGPT2 | 1 | 0.9963 | 0.7984 | 0.4006 | 0.7468 | 0.7997 | 1.016 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | nan | nan | 0.7883 | 0.9276 | | M2M100ForConditionalGeneration | 8 | 0.9818 | 0.9401 | 0.4735 | nan | 0.7694 | 1.0353 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | 0.8274 | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3624 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9476 | 0.3856 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9999 | 0.9982 | 0.0 | 0.0 | 2.1317 | 2.1007 | | ghostnet_100 | 128 | 0.9996 | 0.9766 | 0.8598 | 0.0 | 2.0096 | 1.9488 | | twins_pcpvt_base | 64 | 1.0043 | 0.9166 | 0.8586 | 0.0 | 1.9895 | 1.6939 | | lcnet_050 | 128 | 0.9479 | 0.9342 | 0.791 | 1.1332 | 1.8671 | 1.7225 | | xcit_large_24_p8_224 | 5 | 1.0007 | 0.0 | 0.0 | 0.0 | 1.8316 | 1.7405 | | regnety_002 | 128 | 0.9728 | 0.9306 | 1.0081 | 0.8601 | 1.67 | 1.5503 | | volo_d1_224 | 64 | 0.9998 | 0.9936 | 0.0 | 0.0 | 1.5939 | 1.5532 | | dla102 | 128 | 1.0001 | 0.9964 | 0.8395 | 0.0 | 1.5786 | 1.5502 | | nfnet_l0 | 128 | 0.9999 | 0.8064 | 0.7161 | 0.9301 | 1.5611 | 1.4722 | | gmlp_s16_224 | 128 | 0.9998 | 0.9436 | 0.0 | 0.5365 | 1.5563 | 1.5208 | | hrnet_w18 | 128 | 1.0001 | 0.9938 | 0.8384 | 0.0 | 1.5436 | 1.4602 | | resnest101e | 64 | 0.9997 | 0.9911 | 0.8227 | 0.0 | 1.5315 | 1.4351 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9607 | 0.0 | 0.0 | 1.5307 | 1.5073 | | coat_lite_mini | 128 | 0.9998 | 0.9772 | 0.859 | 0.0 | 1.5254 | 1.5065 | | gluon_inception_v3 | 128 | 0.9995 | 0.9961 | 0.854 | 1.151 | 1.5071 | 1.472 | | adv_inception_v3 | 128 | 0.9998 | 0.9961 | 0.8548 | 1.1517 | 1.5053 | 1.4715 | | inception_v3 | 128 | 0.9999 | 0.9964 | 0.8541 | 1.1506 | 1.5021 | 1.4683 | | dm_nfnet_f0 | 128 | 0.9992 | 0.9967 | 0.0 | 1.0895 | 1.4964 | 1.4287 | | gmixer_24_224 | 128 | 1.0 | 0.8438 | 0.0 | 0.0 | 1.4844 | 1.4784 | | cait_m36_384 | 4 | 1.0005 | 1.0086 | 0.0 | 0.0 | 1.4676 | 1.415 | | res2net50_14w_8s | 128 | 0.9999 | 0.9953 | 0.8129 | 1.1553 | 1.4639 | 1.392 | | convnext_base | 64 | 1.0 | 0.9978 | 0.0 | 0.0 | 1.4576 | 1.4244 | | mobilenetv3_large_100 | 128 | 0.9546 | 0.9386 | 0.7791 | 1.1726 | 1.4557 | 1.4707 | | crossvit_9_240 | 128 | 0.9999 | 0.9947 | 0.8389 | 0.0 | 1.4494 | 1.4173 | | selecsls42b | 128 | 0.9997 | 0.9953 | 0.8412 | 1.2933 | 1.4441 | 1.4131 | | mnasnet_100 | 128 | 0.953 | 0.9438 | 0.789 | 0.0 | 1.4365 | 1.4629 | | res2next50 | 128 | 0.9996 | 0.9959 | 0.8244 | 1.1657 | 1.4176 | 1.3452 | | mobilenetv2_100 | 128 | 0.9521 | 0.9406 | 0.7218 | 1.2531 | 1.4079 | 1.434 | | res2net101_26w_4s | 64 | 1.0028 | 1.0088 | 0.7908 | 0.0 | 1.4033 | 1.2099 | | jx_nest_base | 32 | 0.9997 | 0.9909 | 0.0 | 0.0 | 1.3937 | 1.355 | | fbnetv3_b | 128 | 0.952 | 0.9409 | 0.7748 | 0.0 | 1.3881 | 1.4053 | | ese_vovnet19b_dw | 128 | 0.9715 | 0.965 | 0.7686 | 1.1926 | 1.376 | 1.3799 | | spnasnet_100 | 128 | 0.9475 | 0.9374 | 0.7785 | 0.0 | 1.3723 | 1.3993 | | mobilevit_s | 64 | 0.9728 | 0.812 | 0.6573 | 0.0 | 1.3723 | 1.3601 | | fbnetc_100 | 128 | 0.9521 | 0.9445 | 0.7871 | 0.0 | 1.3579 | 1.3749 | | tf_efficientnet_b0 | 128 | 0.9651 | 0.8086 | 0.6673 | 0.0 | 1.3519 | 1.3542 | | convit_base | 64 | 1.0001 | 0.9964 | 0.0 | 0.0 | 1.3377 | 1.3274 | | cspdarknet53 | 64 | 0.9424 | 0.9335 | 0.7561 | 1.2645 | 1.3299 | 1.3403 | | poolformer_m36 | 64 | 0.9997 | 0.9983 | 0.8079 | 0.0 | 1.3279 | 1.2966 | | pit_b_224 | 64 | 0.9994 | 0.9954 | 0.8204 | 0.6774 | 1.3236 | 1.3172 | | botnet26t_256 | 128 | 0.9796 | 0.9711 | 0.8096 | 0.0 | 1.2916 | 1.2861 | | beit_base_patch16_224 | 64 | 1.0 | 0.9786 | 0.0 | 0.0 | 1.2853 | 1.27 | | deit_base_distilled_patch16_224 | 64 | 1.0001 | 0.9918 | 0.7969 | 0.5961 | 1.2815 | 1.2652 | | mixer_b16_224 | 128 | 1.0 | 0.9589 | 0.7781 | 0.0 | 1.2811 | 1.2668 | | eca_botnext26ts_256 | 128 | 0.9808 | 0.8099 | 0.6703 | 1.1616 | 1.279 | 1.2718 | | rexnet_100 | 128 | 0.9629 | 0.8502 | 0.6909 | 0.0 | 1.2769 | 1.2786 | | visformer_small | 128 | 0.9996 | 1.0018 | 0.8445 | 0.0 | 1.2444 | 1.1807 | | tinynet_a | 128 | 0.9488 | 0.7961 | 0.6479 | 0.0 | 1.2351 | 1.2415 | | pnasnet5large | 16 | 0.9991 | 0.9924 | 0.7953 | 0.0 | 1.2185 | 1.1868 | | sebotnet33ts_256 | 64 | 0.9665 | 0.8337 | 0.679 | 1.0229 | 1.2138 | 1.2081 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 0.8352 | 0.6362 | 1.1961 | 1.1804 | | tf_mixnet_l | 128 | 0.9806 | 0.9094 | 0.7962 | 0.0 | 1.1781 | 1.1714 | | mixnet_l | 128 | 0.9794 | 0.9055 | 0.795 | 0.0 | 1.162 | 1.1536 | | gluon_xception65 | 32 | 0.9997 | 0.9877 | 0.7503 | 0.0 | 1.1593 | 1.1261 | | swsl_resnext101_32x16d | 32 | 0.9995 | 0.9847 | 0.8157 | 0.0 | 1.1495 | 1.0704 | | repvgg_a2 | 128 | 0.9418 | 0.9339 | 0.8004 | 0.0 | 1.1442 | 1.1592 | | dpn107 | 32 | 0.9306 | 0.9137 | 0.745 | 0.0 | 1.1403 | 1.1537 | | resmlp_12_224 | 128 | 0.9999 | 1.0079 | 0.7895 | 1.0507 | 1.1202 | 1.0901 | | gernet_l | 128 | 0.9459 | 0.9375 | 0.7704 | 1.0721 | 1.0654 | 1.0779 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9224 | 0.0 | 1.0551 | 1.0508 | | eca_halonext26ts | 128 | 0.9811 | 0.8167 | 0.6794 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixnet_l | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.7612 | 15.064 | 25.8978 | nan | 142.9682 | 139.6693 | | hrnet_w18 | 128 | 6.3441 | 30.3353 | 57.3778 | nan | 117.0659 | 112.8899 | | mobilevit_s | 64 | 1.8207 | 7.9588 | 15.6754 | nan | 95.4085 | 94.2017 | | swin_base_patch4_window7_224 | 64 | 2.9369 | 12.7806 | nan | nan | 92.8252 | 90.3513 | | xcit_large_24_p8_224 | 5 | 3.2175 | nan | nan | nan | 92.7246 | 87.2524 | | convnext_base | 64 | 1.5145 | 6.8593 | nan | nan | 87.5358 | 86.7686 | | pnasnet5large | 16 | 4.8778 | 21.9423 | 41.0102 | nan | 86.4955 | 82.6063 | | resnest101e | 64 | 3.6081 | 16.6655 | 27.3074 | nan | 77.6632 | 75.6609 | | cait_m36_384 | 4 | 3.4248 | 19.57 | nan | nan | 76.5955 | 73.951 | | coat_lite_mini | 128 | 1.2085 | 5.2914 | 8.5916 | nan | 72.9266 | 72.2273 | | jx_nest_base | 32 | 1.8148 | 9.5322 | nan | nan | 71.34 | 69.6496 | | sebotnet33ts_256 | 64 | 1.6892 | 5.973 | 14.2459 | 189.3768 | 65.7529 | 63.7289 | | res2net101_26w_4s | 64 | 3.3055 | 16.3482 | 28.037 | nan | 62.15 | 57.4394 | | res2net50_14w_8s | 128 | 2.9089 | 14.5291 | 24.4436 | 433.4916 | 55.4029 | 52.616 | | eca_botnext26ts_256 | 128 | 1.4549 | 4.9297 | 10.969 | 152.4185 | 54.2548 | 54.6319 | | gmlp_s16_224 | 128 | 1.2833 | 7.3049 | nan | 209.3998 | 50.3294 | 47.9753 | | botnet26t_256 | 128 | 1.392 | 4.3479 | 9.4266 | nan | 49.909 | 48.5008 | | crossvit_9_240 | 128 | 1.7108 | 8.5134 | 13.3964 | nan | 48.2298 | 47.1319 | | poolformer_m36 | 64 | 1.9064 | 8.2316 | 13.3653 | nan | 47.8279 | 44.6175 | | volo_d1_224 | 64 | 1.4519 | 7.5371 | nan | nan | 45.4667 | 43.1037 | | dpn107 | 32 | 4.0749 | 13.3921 | 39.6618 | nan | 42.9078 | 40.4106 | | gluon_xception65 | 32 | 2.0782 | 10.9694 | 17.998 | nan | 42.8714 | 41.208 | | tnt_s_patch16_224 | 128 | 1.8485 | 10.6038 | nan | nan | 41.6325 | 38.9881 | | fbnetv3_b | 128 | 3.3598 | 11.9645 | 27.7625 | nan | 40.8999 | 39.365 | | gmixer_24_224 | 128 | 1.4576 | 8.2437 | nan | nan | 38.6736 | 37.4894 | | adv_inception_v3 | 128 | 1.6331 | 8.3646 | 13.4435 | 208.9009 | 37.1594 | 34.6386 | | gluon_inception_v3 | 128 | 1.6779 | 8.3651 | 13.7188 | 202.8281 | 36.6878 | 34.8025 | | swsl_resnext101_32x16d | 32 | 1.8397 | 9.1481 | 14.8049 | nan | 36.5704 | 34.5287 | | dla102 | 128 | 1.8746 | 9.4671 | 15.1408 | nan | 36.4808 | 32.9892 | | ghostnet_100 | 128 | 3.0778 | 9.6029 | 14.5333 | nan | 36.3024 | 34.7502 | | inception_v3 | 128 | 1.6499 | 8.3149 | 13.4745 | 206.7449 | 36.2788 | 34.7718 | | tf_mixnet_l | 128 | 5.7936 | 12.9018 | 27.2546 | nan | 35.9804 | 33.8779 | | mixnet_l | 128 | 5.4943 | 12.7038 | 26.7263 | nan | 35.2364 | 33.1053 | | convit_base | 64 | 1.2077 | 6.0058 | nan | nan | 32.6502 | 31.4665 | | dm_nfnet_f0 | 128 | 2.2159 | 7.388 | nan | 173.3215 | 32.1364 | 30.2798 | | res2next50 | 128 | 1.7321 | 8.1543 | 13.3383 | 261.0585 | 31.3377 | 29.6415 | | rexnet_100 | 128 | 2.0263 | 7.359 | 17.1863 | nan | 29.6458 | 28.1015 | | tinynet_a | 128 | 2.2036 | 8.0667 | 19.8203 | nan | 29.143 | 27.3635 | | resmlp_12_224 | 128 | 0.7094 | 3.1748 | 6.8428 | 65.3575 | 27.9654 | 27.1245 | | convmixer_768_32 | 32 | 1.2415 | 6.173 | 10.3846 | nan | 27.5986 | 25.9857 | | mixer_b16_224 | 128 | 0.8202 | 3.9895 | 6.983 | nan | 27.13 | 26.2636 | | visformer_small | 128 | 0.9764 | 4.022 | 6.2875 | nan | 26.911 | 25.8943 | | deit_base_distilled_patch16_224 | 64 | 1.0091 | 4.675 | 7.784 | 89.4587 | 25.9308 | 24.7956 | | tf_efficientnet_b0 | 128 | 1.9042 | 7.0663 | 16.4807 | nan | 25.7278 | 24.3673 | | cspdarknet53 | 64 | 2.3577 | 7.3973 | 18.6343 | 171.017 | 25.676 | 24.6777 | | vit_base_patch16_224 | 64 | 0.9949 | 4.7823 | 7.254 | 97.6307 | 25.6353 | 24.6944 | | fbnetc_100 | 128 | 2.1005 | 6.6747 | 17.6512 | nan | 24.6547 | 22.8858 | | spnasnet_100 | 128 | 2.0729 | 6.5636 | 16.7528 | nan | 23.8342 | 22.5371 | | nfnet_l0 | 128 | 1.8712 | 7.6499 | 10.7899 | 162.2558 | 23.2852 | 22.1583 | | beit_base_patch16_224 | 64 | 1.2196 | 5.9036 | nan | nan | 23.2248 | 22.2672 | | mobilenetv3_large_100 | 128 | 1.6288 | 5.8408 | 13.9528 | 161.7474 | 22.87 | 21.7872 | | pit_b_224 | 64 | 1.1471 | 5.3362 | 9.0349 | 113.4609 | 21.2716 | 20.296 | | mobilenetv2_100 | 128 | 1.6471 | 5.7909 | 13.1629 | 144.2845 | 20.848 | 19.5496 | | mnasnet_100 | 128 | 1.6186 | 5.5782 | 13.2221 | nan | 20.2561 | 19.0402 | | regnety_002 | 128 | 1.6412 | 5.6166 | 13.6458 | 138.5268 | 20.1628 | 18.8518 | | gernet_l | 128 | 2.0456 | 6.0509 | 15.4004 | 140.089 | 19.8553 | 18.8126 | | repvgg_a2 | 128 | 2.0394 | 5.9911 | 15.2077 | nan | 19.4882 | 18.5417 | | selecsls42b | 128 | 0.8615 | 3.8025 | 6.2587 | 103.0517 | 18.0208 | 16.8338 | | lcnet_050 | 128 | 1.0768 | 3.414 | 7.5059 | 91.0657 | 14.7069 | 13.8626 | | ese_vovnet19b_dw | 128 | 1.1041 | 3.1054 | 6.7363 | 69.5019 | 13.966 | 13.2088 | | eca_halonext26ts | 128 | 1.4753 | 5.0804 | 11.2924 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7887 | 0.2766 | nan | 1.37 | 1.5056 | | gmixer_24_224 | 128 | 0.9926 | 0.9248 | nan | nan | 1.3102 | 1.3732 | | gmlp_s16_224 | 128 | 0.9937 | 0.9495 | nan | 0.9045 | 1.2842 | 1.2998 | | tf_efficientnet_b0 | 128 | 0.9877 | 0.7695 | 0.2666 | nan | 1.1888 | 1.3559 | | mobilevit_s | 64 | 0.993 | 0.7669 | 0.2733 | nan | 1.1832 | 1.3099 | | pnasnet5large | 16 | 1.0567 | 0.9911 | 0.3633 | nan | 1.1589 | 1.2896 | | rexnet_100 | 128 | 0.9884 | 0.7848 | 0.2849 | nan | 1.147 | 1.3177 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.5588 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9983 | 0.9433 | 0.3413 | nan | 1.1018 | 1.1171 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3474 | nan | 1.0595 | 1.1461 | | mobilenetv2_100 | 128 | 0.9859 | 0.7635 | 0.3108 | 0.5982 | 1.0588 | 1.1524 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9692 | 0.8981 | nan | 0.7871 | 1.0336 | 1.1292 | | nfnet_l0 | 128 | 0.9887 | 0.8167 | 0.2678 | 0.524 | 1.0318 | 1.1803 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.7316 | 0.9907 | 1.2281 | | fbnetv3_b | 128 | 0.9872 | 0.783 | 0.3151 | nan | 0.986 | 1.043 | | convmixer_768_32 | 32 | 0.9972 | 0.9785 | 0.3447 | nan | 0.9759 | 0.9792 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9749 | 1.0803 | | visformer_small | 128 | 0.9897 | 0.9255 | 0.3467 | nan | 0.9613 | 1.0514 | | dla102 | 128 | 0.9684 | 0.9114 | 0.3365 | nan | 0.9554 | 1.0311 | | ghostnet_100 | 128 | 0.9758 | 0.8691 | 0.337 | nan | 0.9485 | 1.0703 | | tf_mixnet_l | 128 | 0.9907 | 0.8555 | 0.2876 | nan | 0.9364 | 1.0873 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9932 | | mobilenetv3_large_100 | 128 | 0.9773 | 0.8402 | 0.3302 | 0.5782 | 0.9298 | 1.0259 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9775 | | ese_vovnet19b_dw | 128 | 0.9855 | 0.8559 | 0.3271 | 0.7015 | 0.9176 | 1.0681 | | swsl_resnext101_32x16d | 32 | 0.9988 | 0.8771 | 0.3668 | nan | 0.9094 | 0.9794 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | nan | 0.9073 | 0.9799 | | dpn107 | 32 | 0.9981 | 0.9084 | 0.3528 | nan | 0.9061 | 0.9959 | | res2net101_26w_4s | 64 | 0.994 | 0.9149 | 0.3339 | nan | 0.8973 | 0.9734 | | gluon_xception65 | 32 | 0.9955 | 0.8848 | 0.3345 | nan | 0.8967 | 0.9753 | | gluon_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0257 | | adv_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | hrnet_w18 | 128 | 0.9914 | 0.9175 | 0.3348 | nan | 0.8966 | 1.0033 | | fbnetc_100 | 128 | 0.9783 | 0.8475 | 0.33 | nan | 0.8954 | 0.9865 | | selecsls42b | 128 | 0.9796 | 0.8773 | 0.3532 | 0.7509 | 0.8919 | 0.9903 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8683 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8697 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9789 | 0.8799 | 0.3346 | nan | 0.8798 | 0.9821 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | 0.7788 | 0.8764 | 0.9737 | | convnext_base | 64 | 1.003 | 0.9263 | nan | nan | 0.8763 | 0.9864 | | res2next50 | 128 | 0.991 | 0.9094 | 0.3201 | 0.7727 | 0.8709 | 0.9666 | | mnasnet_100 | 128 | 0.976 | 0.8703 | 0.335 | nan | 0.8707 | 0.98 | | mixnet_l | 128 | 0.99 | 0.8442 | 0.2715 | nan | 0.8702 | 1.0089 | | gernet_l | 128 | 0.9791 | 0.85 | 0.3444 | 0.7405 | 0.8615 | 0.9861 | | cspdarknet53 | 64 | 0.9914 | 0.8402 | 0.324 | 0.6522 | 0.8603 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.8639 | 0.3308 | nan | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.943 | 0.7564 | 0.3359 | 0.676 | 0.8449 | 0.9433 | | regnety_002 | 128 | 0.9509 | 0.7946 | 0.3398 | 0.5703 | 0.8352 | 1.0081 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | nan | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | nan | 0.8032 | 1.0344 | | repvgg_a2 | 128 | 0.9769 | 0.7822 | 0.341 | nan | 0.7908 | 0.9914 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | 0.6208 | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | sebotnet33ts_256 | 64 | 0.9929 | 0.7076 | 0.3212 | 0.577 | 0.7451 | 0.8294 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | nan | 0.6707 | 0.8618 | | eca_halonext26ts | 128 | 0.9885 | 0.7747 | 0.2669 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/iWxChks.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/Won3Vm9.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/IjsDL8J.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 53/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 93%, 51/55 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 75%, 41/55 | 49%, 21/43  | 38%, 23/61  |
|    nvprims_nvfuser     | 71%, 39/55 |  16%, 7/43  | 49%, 30/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.05x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.03x    |    1.00x    |    1.15x    |
|        inductor        |   1.42x    |    1.30x    |    1.25x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.16    |    2.29     |    1.92     |
|       aot_eager        |    5.89    |    7.59     |    7.04     |
|     aot_cudagraphs     |    7.58    |    16.01    |    13.16    |
|    nvprims_nvfuser     |   75.24    |   135.18    |   186.40    |
|        inductor        |   33.55    |    31.52    |    38.36    |
| inductor_no_cudagraphs |   33.35    |    27.46    |    37.08    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|    nvprims_nvfuser     |   0.81x    |    0.83x    |    0.82x    |
|        inductor        |   0.83x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   0.99x    |    0.97x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/LrfGTYe.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/F2AKbNT.png)

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0051 | 1.0257 | 1.762 | 0.7197 | 4.1133 | 1.373 | | timm_efficientdet | 1 | 0.9795 | 0.8843 | 0.0 | 0.0 | 3.7369 | 1.6611 | | functorch_dp_cifar10 | 64 | 1.0028 | 1.0388 | 1.454 | 0.0 | 2.9396 | 1.3458 | | timm_vision_transformer | 8 | 1.0033 | 0.9322 | 1.5285 | 0.6812 | 2.6093 | 1.3956 | | BERT_pytorch | 16 | 1.0149 | 0.8804 | 0.0 | 1.0141 | 2.1023 | 2.1419 | | drq | 1 | 0.9994 | 0.8273 | 1.2866 | 0.6647 | 1.9275 | 1.079 | | pytorch_struct | 200 | 0.9921 | 0.7611 | 0.8823 | 0.8238 | 1.8105 | 1.1601 | | lennard_jones | 1000 | 0.9653 | 0.8509 | 1.0251 | 0.7029 | 1.7648 | 0.9424 | | mobilenet_v3_large | 32 | 1.0057 | 1.1265 | 0.8837 | 0.871 | 1.7304 | 1.4176 | | hf_T5_large | 2 | 1.026 | 0.9064 | 0.0 | 0.0 | 1.7089 | 1.6733 | | hf_Albert | 8 | 1.0009 | 0.9976 | 0.753 | 0.0 | 1.6517 | 1.6404 | | resnext50_32x4d | 8 | 1.0029 | 1.1688 | 0.9174 | 0.768 | 1.5663 | 1.2995 | | speech_transformer | 32 | 1.0063 | 0.9123 | 1.5214 | 0.0 | 1.5225 | 1.5697 | | timm_resnest | 32 | 0.9994 | 1.0029 | 0.8051 | 1.1768 | 1.5161 | 1.4527 | | hf_GPT2 | 4 | 1.0065 | 0.9806 | 0.7367 | 0.4062 | 1.4996 | 1.505 | | timm_nfnet | 128 | 0.9998 | 0.9994 | 0.0 | 0.0 | 1.4756 | 1.4245 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 1.0005 | 0.7743 | 0.8991 | 1.4723 | 1.4004 | | squeezenet1_1 | 32 | 0.9911 | 1.0113 | 0.8251 | 0.8143 | 1.4565 | 1.3032 | | mobilenet_v2_quantized_qat | 96 | 1.0015 | 0.9803 | 0.0 | 1.4491 | 1.4343 | 1.4285 | | mobilenet_v2 | 96 | 0.9998 | 0.9996 | 0.7301 | 1.2771 | 1.4287 | 1.4098 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0019 | 1.0824 | 1.008 | 0.0 | 1.4258 | 1.3468 | | fastNLP_Bert | 6 | 0.9989 | 0.9766 | 0.7527 | 0.8384 | 1.4214 | 1.3945 | | resnet18 | 16 | 1.005 | 1.1074 | 0.9082 | 0.9053 | 1.4042 | 1.2547 | | resnet50_quantized_qat | 32 | 1.0012 | 0.9718 | 0.0 | 1.2311 | 1.3867 | 1.3853 | | mnasnet1_0 | 32 | 1.0006 | 1.0596 | 0.8078 | 0.9643 | 1.3813 | 1.2974 | | soft_actor_critic | 256 | 0.9794 | 0.804 | 1.0578 | 0.6722 | 1.38 | 0.9229 | | dcgan | 32 | 0.9819 | 1.0142 | 1.0054 | 0.8059 | 1.3361 | 1.0622 | | pytorch_stargan | 16 | 0.999 | 1.0833 | 0.9259 | 0.0 | 1.2686 | 1.2457 | | hf_Bert | 4 | 1.0333 | 0.998 | 0.7327 | 0.7397 | 1.2107 | 1.1948 | | hf_Bart | 4 | 1.0118 | 0.9737 | 0.0 | 0.7751 | 1.2099 | 1.2052 | | LearningToPaint | 96 | 0.9993 | 1.0016 | 0.8117 | 1.022 | 1.2093 | 1.1741 | | resnet50 | 32 | 0.9992 | 0.9929 | 0.7583 | 1.1008 | 1.2041 | 1.1682 | | pytorch_unet | 1 | 0.9996 | 0.9981 | 0.8437 | 1.0962 | 1.1987 | 1.1862 | | Super_SloMo | 6 | 0.9993 | 0.9976 | 0.8642 | 0.0 | 1.1813 | 1.1661 | | hf_DistilBert | 8 | 0.9998 | 0.9573 | 0.6873 | 0.4387 | 1.1766 | 1.1835 | | vgg16 | 64 | 0.9998 | 0.9992 | 0.8588 | 0.9981 | 1.1728 | 1.1659 | | timm_efficientnet | 32 | 0.9498 | 0.8055 | 0.6447 | 0.8589 | 1.1722 | 1.1688 | | alexnet | 128 | 0.9993 | 0.9981 | 0.8038 | 1.0026 | 1.1624 | 1.1639 | | timm_regnet | 32 | 0.9656 | 0.9605 | 0.7789 | 1.1005 | 1.1212 | 1.0917 | | Background_Matting | 4 | 1.0 | 1.0234 | 0.8632 | 1.0785 | 1.118 | 1.1082 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9194 | 0.0 | 1.1053 | 1.1288 | | hf_BigBird | 2 | 0.9984 | 0.9376 | 0.9648 | 0.9146 | 1.0982 | 1.009 | | yolov3 | 16 | 0.9997 | 0.9948 | 0.7932 | 1.1934 | 1.0928 | 1.0788 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9704 | 0.0 | 0.7032 | 1.0646 | 1.0508 | | timm_vision_transformer_large | 8 | 1.0 | 0.9921 | 0.0 | 0.0 | 1.0447 | 1.0341 | | timm_vovnet | 32 | 0.9101 | 0.9033 | 0.7125 | 0.9701 | 1.0058 | 1.0159 | | tts_angular | 64 | 0.9893 | 0.9562 | 0.9871 | 0.9685 | 1.0056 | 1.0085 | | demucs | 4 | 1.0001 | 0.9997 | 0.9998 | 0.9998 | 1.0002 | 1.0002 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.9235 | 1.1025 | | nvidia_deeprecommender | 256 | 0.9992 | 0.9629 | 0.585 | 0.9576 | 0.9036 | 0.9633 | | hf_GPT2_large | 4 | 1.0002 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4749 | | hf_T5 | 8 | 1.0014 | 0.9573 | 0.0 | 0.9361 | 0.0 | 1.5743 | | tacotron2 | 64 | 0.9756 | 0.8406 | 0.0 | 0.7556 | 0.0 | 0.9014 | | hf_Longformer | 2 | 0.9623 | 0.8926 | 0.8152 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | pass | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8921 | 7.009 | 9.925 | 128.2074 | 374.1903 | 372.0506 | | timm_efficientdet | 1 | 19.6712 | 33.4394 | nan | nan | 167.9644 | 166.2529 | | hf_T5_large | 2 | 13.5789 | 34.3861 | nan | nan | 113.5835 | 111.4109 | | timm_resnest | 32 | 0.5524 | 2.0308 | 3.1178 | 64.1274 | 65.7223 | 65.6198 | | timm_vision_transformer_large | 8 | 2.2942 | 11.0553 | nan | nan | 63.9412 | 59.6169 | | attention_is_all_you_need_pytorch | 256 | 1.119 | 5.608 | nan | 171.5658 | 54.372 | 53.5025 | | timm_vision_transformer | 8 | 0.7704 | 3.3569 | 4.912 | 96.5871 | 49.8705 | 48.9407 | | pytorch_stargan | 16 | 0.3678 | 1.7238 | 2.4991 | nan | 48.7586 | 48.6911 | | densenet121 | 4 | 2.0491 | 9.863 | 16.1864 | 206.0272 | 41.6129 | 41.776 | | pytorch_struct | 200 | 0.2406 | 0.6367 | 1.1657 | 5.8414 | 41.1845 | 43.9808 | | hf_BigBird | 2 | 7.5235 | 12.8009 | 25.4229 | 108.4722 | 38.7749 | 25.3367 | | BERT_pytorch | 16 | 1.4317 | 6.0242 | nan | 138.153 | 32.5406 | 32.2526 | | hf_Bart | 4 | 1.4302 | 6.3526 | nan | 163.6186 | 30.9785 | 29.387 | | resnet50_quantized_qat | 32 | 1.1419 | 7.0872 | nan | 212.6506 | 29.4536 | 29.6701 | | mobilenet_v3_large | 32 | 0.8393 | 3.851 | 5.7433 | 106.7977 | 28.4509 | 26.2181 | | timm_nfnet | 128 | 1.9766 | 6.2005 | nan | nan | 27.6868 | 27.0122 | | hf_Reformer | 4 | 2.422 | nan | 8.2301 | nan | 26.9533 | 21.7639 | | fastNLP_Bert | 6 | 1.4318 | 5.4789 | 8.7401 | 121.9383 | 26.7562 | 25.5923 | | mobilenet_v2_quantized_qat | 96 | 1.2323 | 7.3476 | nan | 237.4632 | 26.2904 | 26.4828 | | speech_transformer | 32 | 1.5816 | 6.8124 | 46.4135 | nan | 26.058 | 24.9664 | | timm_regnet | 32 | 2.1612 | 6.5408 | 16.8039 | 131.6883 | 25.5366 | 24.7257 | | timm_efficientnet | 32 | 1.6679 | 5.6358 | 13.8672 | 117.7449 | 24.9837 | 24.8804 | | timm_vovnet | 32 | 1.4484 | 3.7756 | 8.9111 | 58.4971 | 21.2661 | 19.1387 | | mnasnet1_0 | 32 | 0.7559 | 3.5764 | 5.1921 | 81.2325 | 20.5771 | 19.2914 | | hf_Albert | 8 | 1.0336 | 4.4307 | 7.138 | nan | 20.3026 | 19.0816 | | resnext50_32x4d | 8 | 0.8423 | 3.7107 | 5.4386 | 76.5798 | 20.1778 | 19.8719 | | resnet50 | 32 | 0.8135 | 3.6407 | 5.52 | 87.2594 | 19.9074 | 19.582 | | hf_Bert | 4 | 1.3772 | 5.2642 | 7.5291 | 124.5177 | 18.4456 | 18.2045 | | hf_GPT2 | 4 | 1.2955 | 4.7392 | 7.3774 | 91.4411 | 18.3432 | 17.5537 | | shufflenet_v2_x1_0 | 128 | 0.9004 | 4.0807 | 6.3819 | 96.6456 | 17.3377 | 16.5141 | | Super_SloMo | 6 | 1.0132 | 3.967 | 5.6015 | nan | 16.3964 | 15.8332 | | mobilenet_v2 | 96 | 0.7562 | 3.7283 | 5.9876 | 110.7722 | 16.1183 | 15.8314 | | Background_Matting | 4 | 0.8383 | 3.6888 | 5.7101 | 78.988 | 15.922 | 15.673 | | functorch_dp_cifar10 | 64 | 0.3456 | 1.3503 | 2.0159 | nan | 15.5325 | 15.7101 | | resnet18 | 16 | 0.398 | 1.4545 | 2.1588 | 31.9469 | 13.9773 | 13.68 | | hf_DistilBert | 8 | 0.4595 | 2.3674 | 5.1595 | 54.8469 | 13.0335 | 12.9556 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.362 | 1.5436 | 2.2651 | nan | 8.0512 | 7.7226 | | pytorch_unet | 1 | 0.3986 | 1.5992 | 2.4429 | 32.4526 | 7.734 | 7.5752 | | LearningToPaint | 96 | 0.4134 | 1.5376 | 2.5093 | 42.353 | 6.8483 | 6.344 | | squeezenet1_1 | 32 | 0.2183 | 0.634 | 1.0162 | 4.5223 | 3.9346 | 3.6511 | | drq | 1 | 0.1412 | 0.3431 | 0.6796 | 4.3473 | 3.6651 | 3.0802 | | nvidia_deeprecommender | 256 | 0.1941 | 0.3757 | 0.628 | 7.4875 | 3.4279 | 3.1195 | | vgg16 | 64 | 0.1839 | 0.4678 | 0.8703 | 3.8535 | 3.3769 | 3.0892 | | dlrm | 2048 | nan | nan | nan | nan | 3.2296 | 2.814 | | soft_actor_critic | 256 | 0.1961 | 0.2933 | 0.5132 | 2.6346 | 3.1517 | 2.8479 | | alexnet | 128 | 0.1478 | 0.3268 | 0.5674 | 3.8485 | 2.8473 | 2.7358 | | dcgan | 32 | 0.1718 | 0.3607 | 0.5702 | 4.2716 | 2.6186 | 2.3694 | | tts_angular | 64 | 0.2092 | 0.248 | 0.3777 | 1.038 | 1.9434 | 1.7585 | | lennard_jones | 1000 | 0.1375 | 0.243 | 0.391 | 2.1615 | 1.9293 | 1.7076 | | demucs | 4 | 0.314 | 0.3106 | 0.3074 | 0.306 | 0.2302 | 0.2188 | | tacotron2 | 64 | 17.3201 | 30.8152 | nan | 63.4967 | nan | 68.7967 | | hf_GPT2_large | 4 | 4.9307 | 15.4915 | nan | nan | nan | 42.3005 | | hf_T5 | 8 | 2.2472 | 7.5075 | nan | 95.3407 | nan | 27.6564 | | hf_Longformer | 2 | 6.0125 | 12.8019 | 54.6968 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9961 | 0.8279 | nan | 0.8271 | 1.5828 | 1.5828 | | resnet50_quantized_qat | 32 | 0.9971 | 0.9148 | nan | 0.8498 | 1.4863 | 1.4864 | | timm_efficientnet | 32 | 0.9932 | 0.7665 | 0.2635 | 0.771 | 1.3111 | 1.3958 | | Super_SloMo | 6 | 1.0023 | 0.9526 | 0.363 | nan | 1.2029 | 1.4002 | | mobilenet_v2 | 96 | 0.9923 | 0.7624 | 0.3061 | 0.7641 | 1.1741 | 1.2826 | | timm_efficientdet | 1 | 1.0104 | 0.8221 | nan | nan | 1.1174 | 1.1439 | | squeezenet1_1 | 32 | 0.9781 | 0.8163 | 0.3372 | 0.8132 | 1.0821 | 1.1897 | | speech_transformer | 32 | 0.9982 | 0.9159 | 0.2704 | nan | 1.0396 | 1.0448 | | timm_nfnet | 128 | 0.9358 | 0.8937 | nan | nan | 1.0221 | 1.096 | | demucs | 4 | 0.9884 | 0.9884 | 0.9884 | 0.9888 | 0.9884 | 0.9884 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | Background_Matting | 4 | 0.9989 | 0.9483 | 0.3594 | 0.9327 | 0.9822 | 1.0384 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8142 | 0.9812 | 1.0425 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9984 | 0.8586 | 0.3317 | 0.8055 | 0.9375 | 1.0799 | | yolov3 | 16 | 0.9893 | 0.8384 | 0.3319 | 0.8042 | 0.9175 | 1.098 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9118 | 0.3918 | nan | 0.9166 | 1.0148 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8521 | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9966 | 1.009 | 0.4109 | nan | 0.9023 | 1.0694 | | timm_resnest | 32 | 0.9926 | 0.8759 | 0.3223 | 0.7295 | 0.8947 | 0.9967 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9876 | 0.856 | 0.3277 | 0.7754 | 0.8832 | 0.8974 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 1.0 | 0.8879 | 0.3464 | 0.8612 | 0.8616 | 1.006 | | resnet50 | 32 | 0.9945 | 0.8704 | 0.3364 | 0.7953 | 0.8552 | 0.9335 | | mnasnet1_0 | 32 | 0.9878 | 0.8992 | 0.3333 | 0.8256 | 0.8532 | 0.8671 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.8669 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9618 | 0.8779 | nan | 0.8322 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9961 | 0.8679 | 0.3584 | 0.8198 | 0.8278 | 0.8346 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | 0.8503 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4302 | 0.9629 | 0.8211 | 1.0392 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8756 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9946 | 0.7591 | 0.3201 | 0.7584 | 0.7591 | 0.9501 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3307 | 0.8714 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9304 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7451 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | 0.857 | 0.7061 | 1.0016 | | dlrm | 2048 | nan | nan | nan | nan | 0.7035 | 0.7307 | | LearningToPaint | 96 | 0.9455 | 0.6929 | 0.3408 | 0.627 | 0.6945 | 0.9371 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6949 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | 0.8674 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | 0.7128 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.3181 | 0.9882 | | tacotron2 | 64 | 0.9906 | 1.0302 | nan | 0.7898 | nan | 1.1621 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8195 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2946 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0276 | 0.9051 | 0.0 | 0.0 | 3.7664 | 1.4503 | | CamemBert | 1 | 1.0482 | 0.9426 | 1.3242 | 0.0 | 2.389 | 1.5252 | | MT5ForConditionalGeneration | 8 | 0.9992 | 0.9277 | 0.0 | 1.0209 | 2.3011 | 1.9896 | | DistillGPT2 | 1 | 1.0314 | 0.9311 | 1.1316 | 0.2898 | 2.0526 | 1.8333 | | MobileBertForMaskedLM | 32 | 1.0221 | 0.9316 | 0.0 | 0.8036 | 2.0263 | 1.548 | | GoogleFnet | 1 | 0.9923 | 0.8116 | 0.9816 | 0.0 | 1.8976 | 1.1103 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.977 | 0.0 | 0.6573 | 1.7952 | 1.782 | | M2M100ForConditionalGeneration | 8 | 1.0398 | 0.9424 | 0.8933 | 0.0 | 1.4962 | 1.3808 | | T5ForConditionalGeneration | 4 | 1.0012 | 0.9344 | 0.0 | 0.9077 | 1.4508 | 1.4457 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.9846 | 0.0 | 0.631 | 1.4272 | 1.4065 | | MobileBertForQuestionAnswering | 64 | 1.0176 | 0.9329 | 0.0 | 0.6697 | 1.4167 | 1.344 | | ElectraForCausalLM | 32 | 1.0005 | 0.9308 | 0.0 | 0.5019 | 1.4112 | 1.4521 | | LayoutLMForSequenceClassification | 16 | 0.9997 | 0.9891 | 0.7373 | 0.6842 | 1.3091 | 1.2894 | | T5Small | 1 | 1.0223 | 0.9342 | 0.0 | 0.0 | 1.301 | 1.1472 | | AlbertForQuestionAnswering | 4 | 1.0005 | 1.0014 | 0.0 | 0.0 | 1.2594 | 1.2542 | | AlbertForMaskedLM | 4 | 1.0001 | 1.0001 | 0.0 | 0.0 | 1.2533 | 1.2512 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9696 | 0.0 | 0.7256 | 1.2176 | 1.211 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9614 | 0.0 | 0.6355 | 1.2066 | 1.216 | | OPTForCausalLM | 32 | 1.0065 | 0.9328 | 0.0 | 0.3981 | 1.1867 | 1.2017 | | XGLMForCausalLM | 8 | 1.0138 | 0.9459 | 0.7378 | 0.3053 | 1.1803 | 1.1931 | | MegatronBertForQuestionAnswering | 16 | 1.0392 | 1.0154 | 0.757 | 0.0 | 1.1705 | 1.1234 | | DistilBertForQuestionAnswering | 64 | 0.9999 | 0.9837 | 0.7124 | 0.3979 | 1.1703 | 1.1507 | | RobertaForCausalLM | 64 | 1.0005 | 0.9638 | 0.745 | 0.5515 | 1.1442 | 1.1534 | | MegatronBertForCausalLM | 16 | 1.036 | 1.0082 | 0.7459 | 0.0 | 1.1368 | 1.1127 | | Speech2Text2ForCausalLM | 128 | 0.998 | 0.9272 | 0.6612 | 0.5445 | 1.1345 | 1.1513 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.9928 | 0.0 | 0.5658 | 1.1196 | 1.1123 | | BertForQuestionAnswering | 128 | 1.0003 | 0.9945 | 0.0 | 0.565 | 1.1131 | 1.1067 | | MBartForConditionalGeneration | 16 | 1.0033 | 0.9853 | 0.0 | 0.0 | 1.1008 | 1.0861 | | BartForCausalLM | 4 | 1.0007 | 0.9659 | 0.0 | 0.6752 | 1.1004 | 1.1093 | | BartForConditionalGeneration | 2 | 1.0001 | 0.9888 | 0.0 | 0.3729 | 1.0983 | 1.0889 | | PegasusForConditionalGeneration | 16 | 1.0099 | 0.9803 | 0.7648 | 0.0 | 1.0927 | 1.0815 | | BigBird | 1 | 0.9943 | 0.9363 | 0.9958 | 0.0 | 1.0914 | 0.9932 | | DebertaForMaskedLM | 4 | 0.9285 | 0.8087 | 0.7342 | 0.5919 | 1.0827 | 1.073 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0006 | 0.9411 | 0.0 | 0.6393 | 1.0636 | 1.0929 | | BertForMaskedLM | 64 | 1.0005 | 0.9612 | 0.7303 | 0.5539 | 1.0576 | 1.0579 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9515 | 0.7131 | 0.4254 | 1.0502 | 1.068 | | DebertaForQuestionAnswering | 8 | 0.996 | 0.9634 | 0.684 | 0.8009 | 1.0493 | 1.2204 | | PLBartForCausalLM | 32 | 1.0056 | 0.9338 | 0.7162 | 0.7087 | 1.0307 | 1.0557 | | BlenderbotSmallForCausalLM | 64 | 1.0009 | 0.9106 | 0.6832 | 0.6619 | 1.0068 | 1.0379 | | TrOCRForCausalLM | 32 | 1.0005 | 0.9556 | 0.0 | 0.663 | 1.0043 | 1.0144 | | MBartForCausalLM | 32 | 1.002 | 0.9547 | 0.0 | 0.0 | 1.0009 | 1.0094 | | PegasusForCausalLM | 32 | 0.9994 | 0.9534 | 0.7325 | 0.0 | 0.9929 | 1.0029 | | AllenaiLongformerBase | 1 | 0.9428 | 0.8539 | 0.7836 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.6164 | 10.7286 | 33.9933 | 87.3468 | 92.1654 | 35.9062 | | XGLMForCausalLM | 8 | 2.3309 | 9.8245 | 35.6619 | 253.6322 | 90.7225 | 87.7138 | | DebertaForMaskedLM | 4 | 4.6641 | 10.0146 | 34.101 | 94.2882 | 89.69 | 33.1934 | | M2M100ForConditionalGeneration | 8 | 2.8058 | 11.5704 | 20.0576 | nan | 69.9987 | 62.9836 | | MobileBertForMaskedLM | 32 | 8.2487 | 23.2498 | nan | 362.6216 | 50.542 | 49.9534 | | MobileBertForQuestionAnswering | 64 | 8.2935 | 23.2296 | nan | 370.4161 | 49.0697 | 48.1142 | | YituTechConvBert | 1 | 2.1908 | 8.12 | nan | nan | 47.0889 | 49.1893 | | PegasusForConditionalGeneration | 16 | 2.6354 | 11.7685 | 20.0877 | nan | 43.1633 | 39.9026 | | BartForConditionalGeneration | 2 | 2.8745 | 12.5802 | nan | 332.9029 | 42.4109 | 41.3758 | | MBartForConditionalGeneration | 16 | 2.8812 | 12.2752 | nan | nan | 42.0362 | 40.3121 | | BigBird | 1 | 7.4381 | 13.0007 | 25.4388 | nan | 37.4992 | 25.6524 | | MT5ForConditionalGeneration | 8 | 3.604 | 10.9514 | nan | 147.7995 | 35.106 | 34.4865 | | MegatronBertForCausalLM | 16 | 3.1682 | 10.6627 | 17.241 | nan | 33.2493 | 32.1536 | | T5Small | 1 | 2.2578 | 7.5715 | nan | nan | 33.1038 | 32.7085 | | LayoutLMForSequenceClassification | 16 | 1.7659 | 5.6378 | 8.6782 | 127.3332 | 33.0745 | 32.3503 | | MegatronBertForQuestionAnswering | 16 | 3.0035 | 10.7079 | 16.4308 | nan | 32.3094 | 31.3369 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7378 | 8.1237 | nan | 221.5272 | 30.9988 | 30.8006 | | T5ForConditionalGeneration | 4 | 2.2056 | 7.611 | nan | 95.974 | 30.7195 | 30.1238 | | PLBartForConditionalGeneration | 16 | 1.4324 | 6.4116 | nan | 170.7059 | 28.0564 | 27.4134 | | ElectraForCausalLM | 32 | 1.3444 | 5.2427 | nan | 127.3306 | 27.5988 | 25.5252 | | PegasusForCausalLM | 32 | 1.0436 | 4.7352 | 7.6663 | nan | 22.0582 | 21.1606 | | LayoutLMForMaskedLM | 16 | 1.7436 | 5.8416 | nan | 131.3744 | 21.4164 | 20.5963 | | MBartForCausalLM | 32 | 1.0054 | 4.8696 | nan | nan | 21.2667 | 20.605 | | GoogleFnet | 1 | 0.8096 | 2.7434 | 9.2698 | nan | 20.806 | 14.3773 | | BertForMaskedLM | 64 | 1.3109 | 5.1952 | 7.9355 | 123.8701 | 20.4912 | 20.4575 | | ElectraForQuestionAnswering | 64 | 1.3276 | 5.2232 | nan | 129.1294 | 20.3727 | 19.8754 | | TrOCRForCausalLM | 32 | 0.9864 | 4.4775 | nan | 122.7422 | 20.0524 | 19.6781 | | BartForCausalLM | 4 | 1.0483 | 4.5279 | nan | 124.5731 | 19.6174 | 18.9766 | | BertForQuestionAnswering | 128 | 1.381 | 5.5547 | nan | 125.5889 | 19.3494 | 18.7848 | | RobertaForCausalLM | 64 | 1.3606 | 5.4104 | 7.9953 | 132.3781 | 18.9703 | 18.3936 | | CamemBert | 1 | 1.4309 | 5.3349 | 7.5094 | nan | 18.0966 | 17.8193 | | RobertaForQuestionAnswering | 128 | 1.379 | 5.2427 | nan | 126.3528 | 17.9519 | 17.5563 | | OPTForCausalLM | 32 | 1.0534 | 4.6435 | nan | 108.3475 | 17.6168 | 17.2022 | | GPT2ForSequenceClassification | 4 | 1.319 | 4.9408 | nan | 94.5868 | 16.8628 | 16.4419 | | AlbertForMaskedLM | 4 | 1.1525 | 4.5566 | nan | nan | 16.0738 | 15.1475 | | BlenderbotSmallForCausalLM | 64 | 0.6523 | 3.4677 | 4.9373 | 82.7236 | 16.0705 | 15.9198 | | DistillGPT2 | 1 | 0.6606 | 2.5646 | 3.7533 | 47.9127 | 15.9646 | 15.8652 | | AlbertForQuestionAnswering | 4 | 1.1221 | 4.5907 | nan | nan | 15.5865 | 14.9401 | | Speech2Text2ForCausalLM | 128 | 0.5733 | 2.4828 | 4.0471 | 57.5394 | 15.1313 | 14.1446 | | PLBartForCausalLM | 32 | 0.5111 | 2.5101 | 3.7663 | 66.858 | 14.411 | 13.9633 | | DistilBertForMaskedLM | 64 | 0.4636 | 2.557 | 5.254 | 55.2646 | 12.3871 | 11.9258 | | DistilBertForQuestionAnswering | 64 | 0.4833 | 2.5897 | 5.6586 | 53.6268 | 11.6215 | 11.2366 | | AllenaiLongformerBase | 1 | 6.1626 | 13.2049 | 56.7362 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8819 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | nan | nan | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0095 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | 0.8577 | 0.8215 | 1.1049 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8115 | 0.377 | 0.7597 | 0.8063 | 0.9258 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7039 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9858 | 0.8581 | nan | nan | 0.7895 | 0.8729 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | nan | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9983 | 0.9443 | 0.3714 | nan | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | nan | 0.8712 | 0.7627 | 0.9397 | | M2M100ForConditionalGeneration | 8 | 0.9754 | 0.9382 | 0.3799 | nan | 0.7562 | 1.0178 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3613 | nan | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.8428 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8957 | nan | 0.8416 | 0.724 | 0.9375 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.8653 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | 0.8553 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8217 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.8762 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8551 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.7756 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | 0.8587 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.737 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | 0.7894 | 0.6761 | 0.8847 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.669 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.7194 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3642 | 0.7649 | 0.6375 | 0.8974 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | 0.6092 | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | 0.6675 | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.8282 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9998 | 0.9736 | 0.8276 | 0.0 | 1.8706 | 1.8308 | | lcnet_050 | 128 | 0.9564 | 0.9494 | 0.7695 | 1.4146 | 1.6577 | 1.695 | | convnext_base | 64 | 0.9994 | 0.999 | 0.0 | 1.3699 | 1.5122 | 1.5057 | | dm_nfnet_f0 | 128 | 0.9996 | 1.0006 | 0.0 | 1.2425 | 1.4737 | 1.4209 | | hrnet_w18 | 128 | 1.0 | 0.9995 | 0.0 | 1.3149 | 1.4174 | 1.3781 | | volo_d1_224 | 64 | 0.9998 | 0.9956 | 0.0 | 0.0 | 1.3871 | 1.3664 | | dla102 | 128 | 0.9999 | 1.0009 | 0.0 | 0.0 | 1.38 | 1.3689 | | nfnet_l0 | 128 | 0.9998 | 0.7893 | 0.0 | 0.0 | 1.373 | 1.3281 | | xcit_large_24_p8_224 | 5 | 0.9995 | 0.972 | 0.0 | 0.0 | 1.3553 | 1.3154 | | res2net50_14w_8s | 128 | 0.9999 | 0.9999 | 0.0 | 1.2412 | 1.3545 | 1.3235 | | mobilenetv2_100 | 128 | 0.9669 | 0.9629 | 0.7061 | 1.2309 | 1.3384 | 1.3473 | | mobilenetv3_large_100 | 128 | 0.9665 | 0.9624 | 0.7635 | 1.2985 | 1.3346 | 1.3476 | | adv_inception_v3 | 128 | 1.0 | 0.9983 | 0.0 | 1.1276 | 1.3268 | 1.3073 | | gluon_inception_v3 | 128 | 1.0001 | 0.9989 | 0.0 | 1.1282 | 1.3267 | 1.3072 | | inception_v3 | 128 | 0.9999 | 0.9986 | 0.0 | 1.1277 | 1.3264 | 1.3057 | | crossvit_9_240 | 128 | 0.9997 | 0.9989 | 0.0 | 0.0 | 1.322 | 1.2956 | | resnest101e | 64 | 0.9998 | 1.0035 | 0.0 | 0.0 | 1.3158 | 1.2717 | | regnety_002 | 128 | 0.9488 | 0.943 | 0.7561 | 1.0459 | 1.3121 | 1.3174 | | res2next50 | 128 | 0.9998 | 1.0015 | 0.0 | 1.172 | 1.3097 | 1.2715 | | fbnetv3_b | 128 | 0.9646 | 0.9617 | 0.7603 | 0.0 | 1.2845 | 1.2969 | | jx_nest_base | 32 | 0.9998 | 0.9953 | 0.0 | 0.0 | 1.2773 | 1.2509 | | coat_lite_mini | 128 | 1.0 | 0.9858 | 0.8522 | 0.4082 | 1.277 | 1.2668 | | eca_botnext26ts_256 | 128 | 0.9872 | 0.7719 | 0.0 | 0.0 | 1.2682 | 1.2516 | | sebotnet33ts_256 | 64 | 0.975 | 0.8041 | 0.0 | 0.0 | 1.2669 | 1.2689 | | selecsls42b | 128 | 0.9999 | 1.0003 | 0.8155 | 1.2122 | 1.2667 | 1.2524 | | mnasnet_100 | 128 | 0.9661 | 0.9626 | 0.7853 | 1.248 | 1.2651 | 1.2818 | | tf_efficientnet_b0 | 128 | 0.9768 | 0.7837 | 0.0 | 1.2167 | 1.2606 | 1.2671 | | eca_halonext26ts | 128 | 0.987 | 0.7784 | 0.0 | 0.0 | 1.2599 | 1.2444 | | botnet26t_256 | 128 | 0.9857 | 0.982 | 0.7873 | 0.0 | 1.2559 | 1.2587 | | fbnetc_100 | 128 | 0.966 | 0.9611 | 0.7896 | 0.0 | 1.2484 | 1.2644 | | ese_vovnet19b_dw | 128 | 0.9785 | 0.9779 | 0.7434 | 1.1619 | 1.2417 | 1.2476 | | gmixer_24_224 | 128 | 0.9999 | 0.81 | 0.0 | 0.0 | 1.2409 | 1.2325 | | spnasnet_100 | 128 | 0.9592 | 0.9584 | 0.7717 | 1.2205 | 1.2377 | 1.2534 | | convit_base | 64 | 0.9995 | 0.9989 | 0.0 | 0.8582 | 1.2291 | 1.1865 | | res2net101_26w_4s | 64 | 0.9998 | 0.9993 | 0.7724 | 1.1755 | 1.2243 | 1.1891 | | cspdarknet53 | 64 | 0.9573 | 0.9529 | 0.7344 | 1.1953 | 1.2208 | 1.234 | | rexnet_100 | 128 | 0.9734 | 0.8151 | 0.0 | 0.0 | 1.2135 | 1.22 | | pnasnet5large | 16 | 0.9997 | 0.9989 | 0.0 | 1.0652 | 1.2088 | 1.1929 | | gmlp_s16_224 | 128 | 0.9999 | 0.9503 | 0.0 | 0.0 | 1.2015 | 1.1909 | | twins_pcpvt_base | 64 | 1.0 | 0.9985 | 0.7538 | 0.0 | 1.2011 | 1.1713 | | tinynet_a | 128 | 0.9654 | 0.776 | 0.6202 | 0.0 | 1.1926 | 1.1997 | | dpn107 | 32 | 0.9573 | 0.9501 | 0.779 | 0.0 | 1.1897 | 1.1996 | | pit_b_224 | 64 | 1.0002 | 0.9996 | 0.0 | 0.5964 | 1.1872 | 1.1751 | | cait_m36_384 | 4 | 0.9997 | 1.0273 | 0.0 | 0.0 | 1.1816 | 1.1576 | | repvgg_a2 | 128 | 0.9641 | 0.9635 | 0.8267 | 1.1218 | 1.1721 | 1.1671 | | mobilevit_s | 64 | 0.9793 | 0.762 | 0.0 | 0.0 | 1.1697 | 1.1656 | | poolformer_m36 | 64 | 0.9999 | 0.9996 | 0.0 | 0.0 | 1.1659 | 1.1471 | | tf_mixnet_l | 128 | 0.9854 | 0.8896 | 0.0 | 1.0959 | 1.1638 | 1.1666 | | mixnet_l | 128 | 0.9846 | 0.8858 | 0.0 | 0.0 | 1.1515 | 1.1482 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9788 | 0.0 | 0.0 | 1.141 | 1.1341 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9819 | 0.0 | 0.5277 | 1.1153 | 1.1023 | | swsl_resnext101_32x16d | 32 | 1.0002 | 1.0002 | 0.0 | 0.0 | 1.1103 | 1.0709 | | deit_base_distilled_patch16_224 | 64 | 1.0002 | 0.9998 | 0.7685 | 0.5694 | 1.097 | 1.0837 | | gluon_xception65 | 32 | 1.0001 | 0.9973 | 0.0 | 0.0 | 1.0876 | 1.0739 | | vit_base_patch16_224 | 64 | 0.9994 | 0.9996 | 0.7685 | 0.5561 | 1.086 | 1.0752 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0777 | 1.0745 | | gernet_l | 128 | 0.9737 | 0.9732 | 0.8204 | 1.0958 | 1.0745 | 1.0699 | | mixer_b16_224 | 128 | 0.9991 | 0.9784 | 0.0 | 0.0 | 1.0679 | 1.0635 | | visformer_small | 128 | 0.9997 | 1.003 | 0.8002 | 0.0 | 1.0453 | 1.0129 | | resmlp_12_224 | 128 | 1.0001 | 0.8547 | 0.6119 | 0.8351 | 0.8201 | 0.8018 | | tnt_s_patch16_224 | 128 | 0.9997 | 0.9995 | 0.0 | 0.0 | 0.0 | 1.5449 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilevit_s | 64 | 1.5803 | 5.976 | nan | nan | 100.7134 | 99.2212 | | hrnet_w18 | 128 | 5.6065 | 24.4099 | nan | 992.3749 | 98.8856 | 93.1754 | | swin_base_patch4_window7_224 | 64 | 2.537 | 10.8877 | nan | nan | 90.5102 | 88.672 | | xcit_large_24_p8_224 | 5 | 2.7039 | 13.7787 | nan | nan | 82.6148 | 79.2271 | | twins_pcpvt_base | 64 | 2.0485 | 10.4914 | 18.7961 | nan | 82.0185 | 80.7442 | | convnext_base | 64 | 1.3572 | 5.4307 | nan | 174.9215 | 73.6456 | 71.1603 | | pnasnet5large | 16 | 4.334 | 17.7991 | nan | 504.2814 | 72.8294 | 69.6441 | | jx_nest_base | 32 | 1.669 | 7.5167 | nan | nan | 67.1457 | 65.4548 | | coat_lite_mini | 128 | 1.038 | 4.15 | 6.9253 | 119.7175 | 67.0858 | 66.0233 | | cait_m36_384 | 4 | 2.7713 | 14.4994 | nan | nan | 66.9488 | 64.1302 | | eca_halonext26ts | 128 | 1.4838 | 4.5239 | nan | nan | 62.0231 | 60.5173 | | resnest101e | 64 | 3.0562 | 13.6063 | nan | nan | 61.4086 | 60.5222 | | sebotnet33ts_256 | 64 | 1.5988 | 5.1323 | nan | nan | 58.9186 | 58.0232 | | res2net101_26w_4s | 64 | 2.9374 | 13.2399 | 22.9997 | 347.9112 | 51.6815 | 49.6329 | | eca_botnext26ts_256 | 128 | 1.3758 | 4.3601 | nan | nan | 50.5397 | 49.6467 | | res2net50_14w_8s | 128 | 2.5586 | 12.0444 | nan | 354.1437 | 47.3534 | 45.5608 | | gmlp_s16_224 | 128 | 0.9666 | 5.2165 | nan | nan | 45.4083 | 43.369 | | poolformer_m36 | 64 | 1.8521 | 7.1928 | nan | nan | 44.2069 | 42.5536 | | botnet26t_256 | 128 | 1.322 | 3.7908 | 8.5324 | nan | 44.014 | 42.9402 | | crossvit_9_240 | 128 | 1.4001 | 6.4622 | nan | nan | 43.5468 | 41.5953 | | volo_d1_224 | 64 | 1.3203 | 6.204 | nan | nan | 38.6094 | 36.9204 | | gluon_xception65 | 32 | 1.7798 | 8.7781 | nan | nan | 37.6714 | 35.0373 | | dpn107 | 32 | 3.9839 | 11.5267 | 35.6981 | nan | 37.0883 | 35.4228 | | fbnetv3_b | 128 | 3.073 | 9.3824 | 24.7182 | nan | 34.8733 | 32.8491 | | gmixer_24_224 | 128 | 1.067 | 6.044 | nan | nan | 33.6636 | 32.5216 | | swsl_resnext101_32x16d | 32 | 1.6696 | 7.5384 | nan | nan | 32.7836 | 30.3039 | | tf_mixnet_l | 128 | 5.7126 | 11.4465 | nan | 170.6917 | 32.6962 | 29.7682 | | gluon_inception_v3 | 128 | 1.5171 | 6.8499 | nan | 169.627 | 31.9884 | 30.6582 | | adv_inception_v3 | 128 | 1.5301 | 6.8961 | nan | 174.4463 | 31.9478 | 30.645 | | inception_v3 | 128 | 1.5073 | 6.8606 | nan | 173.3994 | 31.9319 | 30.5585 | | ghostnet_100 | 128 | 2.6921 | 8.0049 | 12.0255 | nan | 31.3898 | 29.6351 | | dla102 | 128 | 1.7023 | 7.7247 | nan | nan | 30.9593 | 27.9777 | | convit_base | 64 | 1.0857 | 4.7769 | nan | 149.6295 | 30.1584 | 28.8583 | | mixnet_l | 128 | 5.3012 | 10.8337 | nan | nan | 30.0739 | 28.3537 | | dm_nfnet_f0 | 128 | 2.0932 | 6.1946 | nan | 165.0695 | 29.3952 | 28.6965 | | res2next50 | 128 | 1.7329 | 6.834 | nan | 187.8375 | 27.5297 | 26.3327 | | resmlp_12_224 | 128 | 0.6938 | 2.402 | 5.1111 | 58.4165 | 26.1834 | 25.2927 | | rexnet_100 | 128 | 1.805 | 6.2789 | nan | nan | 25.8145 | 24.3747 | | tinynet_a | 128 | 2.0113 | 7.0141 | 17.3377 | nan | 25.1146 | 23.8997 | | visformer_small | 128 | 0.9341 | 3.566 | 5.4144 | nan | 25.1027 | 24.3265 | | convmixer_768_32 | 32 | 1.1136 | 4.9094 | nan | nan | 24.5701 | 22.7699 | | mixer_b16_224 | 128 | 0.6692 | 2.8895 | nan | nan | 24.0295 | 22.6455 | | cspdarknet53 | 64 | 2.2132 | 6.6236 | 16.9246 | 128.8333 | 22.1703 | 21.425 | | tf_efficientnet_b0 | 128 | 1.7784 | 5.6296 | nan | 138.4678 | 21.8944 | 21.2456 | | deit_base_distilled_patch16_224 | 64 | 0.8417 | 3.5778 | 5.6599 | 93.6667 | 21.1929 | 20.4483 | | vit_base_patch16_224 | 64 | 0.8121 | 3.625 | 5.7604 | 96.3198 | 21.0668 | 20.2179 | | nfnet_l0 | 128 | 1.7769 | 6.1515 | nan | nan | 21.0492 | 19.7491 | | fbnetc_100 | 128 | 1.9692 | 5.6312 | 15.5132 | nan | 20.7786 | 19.5842 | | spnasnet_100 | 128 | 1.9331 | 5.586 | 15.4386 | 124.4346 | 20.5598 | 19.5373 | | beit_base_patch16_224 | 64 | 1.1296 | 4.6216 | nan | 118.1445 | 20.361 | 19.2449 | | mobilenetv3_large_100 | 128 | 1.5018 | 4.801 | 11.7282 | 132.432 | 19.3925 | 18.5079 | | pit_b_224 | 64 | 0.9761 | 3.876 | nan | 122.5178 | 18.3368 | 18.2255 | | mobilenetv2_100 | 128 | 1.5673 | 4.7471 | 11.8597 | 114.4144 | 18.2825 | 18.2445 | | repvgg_a2 | 128 | 1.9275 | 5.1841 | 13.9717 | 264.0671 | 17.8067 | 16.9663 | | mnasnet_100 | 128 | 1.5433 | 4.5184 | 11.7639 | 99.2619 | 17.6837 | 16.7456 | | gernet_l | 128 | 1.9336 | 5.1404 | 14.0576 | 105.6485 | 17.569 | 16.1919 | | regnety_002 | 128 | 1.5468 | 4.4861 | 11.8842 | 101.4667 | 17.1643 | 16.1694 | | selecsls42b | 128 | 0.7923 | 2.991 | 4.8693 | 79.6399 | 15.4973 | 14.6289 | | lcnet_050 | 128 | 0.9832 | 2.8907 | 6.7156 | 76.06 | 12.9161 | 11.9992 | | ese_vovnet19b_dw | 128 | 1.0399 | 2.62 | 6.0028 | 54.2528 | 12.3277 | 11.8415 | | tnt_s_patch16_224 | 128 | 1.5773 | 8.0902 | nan | nan | nan | 33.694 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | nan | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9943 | 0.7798 | 0.2617 | nan | 1.3515 | 1.5856 | | nfnet_l0 | 128 | 0.993 | 0.8275 | nan | nan | 1.2905 | 1.4934 | | rexnet_100 | 128 | 0.9938 | 0.7848 | nan | nan | 1.2631 | 1.4763 | | tf_efficientnet_b0 | 128 | 0.9936 | 0.7689 | nan | 0.7729 | 1.206 | 1.3823 | | pnasnet5large | 16 | 1.0657 | 1.0089 | nan | 0.9609 | 1.183 | 1.3351 | | mobilevit_s | 64 | 0.9964 | 0.7671 | nan | nan | 1.1799 | 1.3596 | | mobilenetv2_100 | 128 | 0.9923 | 0.7619 | 0.3063 | 0.7644 | 1.1747 | 1.2829 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1375 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1185 | 1.1746 | | poolformer_m36 | 64 | 0.9983 | 0.9509 | nan | nan | 1.0521 | 1.0698 | | dm_nfnet_f0 | 128 | 0.9357 | 0.894 | nan | 0.8793 | 1.0221 | 1.0963 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9253 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.9131 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.993 | 0.7828 | 0.3098 | nan | 0.9932 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.9157 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9923 | 1.0857 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9847 | 0.9968 | | ghostnet_100 | 128 | 0.9866 | 0.8768 | 0.3271 | nan | 0.9842 | 1.1252 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | nan | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9955 | 0.8572 | nan | 0.7712 | 0.9767 | 1.1453 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | nan | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | nan | nan | 0.9633 | 1.0573 | | dla102 | 128 | 0.9828 | 0.9169 | nan | nan | 0.9625 | 1.0421 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8868 | 0.3259 | 0.8551 | 0.951 | 1.0926 | | gluon_xception65 | 32 | 0.9975 | 0.9358 | nan | nan | 0.9412 | 0.9929 | | mobilenetv3_large_100 | 128 | 0.9874 | 0.8592 | 0.3245 | 0.7755 | 0.941 | 1.0413 | | hrnet_w18 | 128 | 0.9955 | 0.9252 | nan | 0.8513 | 0.9382 | 1.0121 | | spnasnet_100 | 128 | 0.9885 | 0.9103 | 0.3308 | 0.8383 | 0.9379 | 0.9927 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9022 | 0.3305 | 0.8252 | 0.9325 | 0.9921 | | res2net101_26w_4s | 64 | 0.9967 | 0.9278 | 0.3243 | 0.8768 | 0.93 | 1.0167 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9154 | 0.9655 | | cspdarknet53 | 64 | 0.9954 | 0.8613 | 0.3159 | 0.8261 | 0.9148 | 1.0666 | | gluon_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.914 | 1.063 | | adv_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9139 | 1.063 | | inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9139 | 1.063 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.867 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9955 | 0.9149 | nan | 0.8461 | 0.9075 | 1.0161 | | mixnet_l | 128 | 0.995 | 0.845 | nan | nan | 0.9069 | 1.0619 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0515 | | fbnetc_100 | 128 | 0.989 | 0.8518 | 0.3236 | nan | 0.9049 | 0.9971 | | dpn107 | 32 | 0.9986 | 0.9268 | 0.3389 | nan | 0.9047 | 0.9908 | | visformer_small | 128 | 0.9944 | 0.9374 | 0.3291 | nan | 0.9029 | 0.9934 | | selecsls42b | 128 | 0.9885 | 0.8897 | 0.337 | 0.8775 | 0.8987 | 1.0049 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.8965 | nan | nan | 0.8911 | 0.9925 | | res2net50_14w_8s | 128 | 0.995 | 0.9047 | nan | 0.8422 | 0.8821 | 1.0211 | | regnety_002 | 128 | 0.9718 | 0.8105 | 0.3284 | 0.7203 | 0.8619 | 1.0399 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9624 | | pit_b_224 | 64 | 0.9968 | 0.7946 | nan | 0.7502 | 0.8563 | 1.0753 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7251 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7891 | 0.32 | 0.7891 | 0.7928 | 0.9932 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6276 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6552 | 0.7684 | 0.9903 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8573 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | nan | nan | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/dwFTEnn.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/LUX4ubq.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/hchuBLT.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 52/53 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 52/53 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 77%, 41/53 | 60%, 25/42  | 79%, 48/61  |
|    nvprims_nvfuser     | 51%, 27/53 |  12%, 5/42  | 34%, 21/61  |
|        inductor        | 85%, 45/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 48/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.05x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.00x    |    1.11x    |
|        inductor        |   1.70x    |    1.81x    |    1.40x    |
| inductor_no_cudagraphs |   1.39x    |    1.54x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.36    |    2.73     |    2.13     |
|       aot_eager        |    6.94    |    10.08    |    8.49     |
|     aot_cudagraphs     |   10.31    |    18.21    |    16.74    |
|    nvprims_nvfuser     |   68.70    |   131.09    |   160.24    |
|        inductor        |   34.52    |    37.14    |    44.56    |
| inductor_no_cudagraphs |   34.02    |    32.54    |    42.93    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.39x    |    0.32x    |
|    nvprims_nvfuser     |   0.75x    |    0.79x    |    0.69x    |
|        inductor        |   0.84x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.97x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0029 | 0.9123 | 1.8806 | 0.6682 | 5.2683 | 1.4682 | | timm_efficientdet | 1 | 0.985 | 0.7811 | 0.0 | 0.0 | 4.2531 | 1.7445 | | BERT_pytorch | 16 | 1.0125 | 0.8412 | 0.0 | 0.0 | 3.5248 | 2.3558 | | functorch_dp_cifar10 | 64 | 1.0022 | 0.9345 | 1.5904 | 0.0 | 3.4954 | 1.5276 | | timm_vision_transformer | 8 | 1.0048 | 0.8451 | 1.7758 | 0.6174 | 3.1607 | 1.5531 | | hf_T5_large | 2 | 1.0192 | 0.8572 | 0.0 | 0.0 | 2.5888 | 2.1297 | | drq | 1 | 0.9972 | 0.7917 | 1.6399 | 0.5614 | 2.4573 | 1.1864 | | mobilenet_v3_large | 32 | 1.0054 | 1.0018 | 1.1984 | 0.7412 | 2.4397 | 1.5711 | | resnext50_32x4d | 8 | 1.0008 | 0.9516 | 1.306 | 0.6831 | 2.4348 | 1.3904 | | hf_Albert | 8 | 1.0006 | 0.9561 | 0.774 | 0.0 | 2.3789 | 2.3245 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9974 | 0.9742 | 1.5992 | 0.0 | 2.2629 | 1.8056 | | mnasnet1_0 | 32 | 1.0001 | 1.0211 | 1.0296 | 0.734 | 2.0915 | 1.5444 | | lennard_jones | 1000 | 0.9737 | 0.7664 | 1.2568 | 0.4467 | 2.0653 | 1.0554 | | pytorch_struct | 200 | 1.0007 | 0.7381 | 1.0155 | 0.624 | 2.0623 | 1.2757 | | hf_Bert | 4 | 1.0395 | 0.8499 | 0.9352 | 0.0 | 2.0049 | 1.839 | | resnet18 | 16 | 1.0013 | 1.023 | 1.1376 | 0.7263 | 1.9881 | 1.3849 | | hf_GPT2 | 4 | 1.023 | 0.9867 | 0.0 | 0.3022 | 1.9344 | 1.8907 | | hf_T5 | 8 | 0.9998 | 0.9268 | 0.0 | 1.1446 | 1.8842 | 1.8866 | | timm_resnest | 32 | 0.9978 | 0.9858 | 0.812 | 0.0 | 1.86 | 1.7146 | | squeezenet1_1 | 32 | 0.9987 | 0.9609 | 1.1073 | 0.6723 | 1.8158 | 1.4605 | | hf_Bart | 4 | 1.0115 | 0.8347 | 0.0 | 0.0 | 1.7908 | 1.7212 | | dcgan | 32 | 0.9658 | 0.9034 | 1.1203 | 0.6054 | 1.7195 | 1.0932 | | speech_transformer | 32 | 1.0057 | 0.8574 | 1.9226 | 0.0 | 1.6897 | 1.6932 | | timm_efficientnet | 32 | 0.9541 | 0.7729 | 0.8631 | 0.0 | 1.6748 | 1.3675 | | hf_DistilBert | 8 | 1.0009 | 0.9721 | 0.7676 | 0.298 | 1.5744 | 1.4835 | | mobilenet_v2 | 96 | 1.0 | 0.9898 | 0.7611 | 0.0 | 1.5605 | 1.5015 | | attention_is_all_you_need_pytorch | 256 | 1.0106 | 0.9147 | 0.0 | 0.6266 | 1.553 | 1.4983 | | shufflenet_v2_x1_0 | 128 | 0.9978 | 1.0432 | 0.8759 | 0.8678 | 1.5275 | 1.383 | | LearningToPaint | 96 | 1.0011 | 1.0141 | 0.916 | 0.8163 | 1.5265 | 1.3567 | | fastNLP_Bert | 6 | 1.0 | 0.8936 | 0.7653 | 0.0 | 1.5263 | 1.473 | | timm_nfnet | 128 | 0.9992 | 0.999 | 0.0 | 1.0799 | 1.5053 | 1.4312 | | soft_actor_critic | 256 | 0.9614 | 0.7412 | 1.3142 | 0.5436 | 1.49 | 1.0271 | | resnet50 | 32 | 1.0004 | 1.048 | 0.8197 | 0.806 | 1.3651 | 1.2964 | | pytorch_stargan | 16 | 0.9972 | 1.1218 | 0.9635 | 0.0 | 1.3644 | 1.3103 | | pytorch_unet | 1 | 0.9986 | 0.9918 | 0.8617 | 1.0847 | 1.3373 | 1.3129 | | Super_SloMo | 6 | 0.9995 | 0.9948 | 0.8852 | 0.0 | 1.2888 | 1.2588 | | vgg16 | 64 | 0.9998 | 0.9977 | 0.8572 | 0.9707 | 1.2739 | 1.2607 | | Background_Matting | 4 | 0.9999 | 1.019 | 0.8909 | 1.0651 | 1.2346 | 1.2193 | | alexnet | 128 | 0.999 | 0.9975 | 0.8153 | 0.9261 | 1.208 | 1.2089 | | hf_Reformer | 4 | 0.9952 | 0.9906 | 0.9373 | 0.0 | 1.1596 | 1.1497 | | timm_vision_transformer_large | 8 | 1.0 | 0.9903 | 0.0 | 0.0 | 1.1572 | 1.1376 | | hf_BigBird | 2 | 0.9915 | 0.9144 | 1.0493 | 0.837 | 1.151 | 1.027 | | timm_regnet | 32 | 0.9488 | 0.9279 | 0.7662 | 0.8285 | 1.1275 | 1.0685 | | yolov3 | 16 | 0.9995 | 0.9901 | 0.8025 | 0.0 | 1.0901 | 1.066 | | timm_vovnet | 32 | 0.9039 | 0.8769 | 0.7237 | 0.7852 | 1.0768 | 1.1245 | | tts_angular | 64 | 0.9814 | 0.9554 | 0.9831 | 0.9511 | 1.0172 | 1.0125 | | demucs | 4 | 0.9993 | 0.9978 | 0.9996 | 1.0004 | 1.0004 | 0.9998 | | nvidia_deeprecommender | 256 | 0.9988 | 0.996 | 0.6964 | 1.0184 | 0.989 | 1.0302 | | hf_GPT2_large | 4 | 1.0002 | 0.9931 | 0.0 | 0.0 | 0.0 | 1.8617 | | dlrm | 2048 | 1.0071 | 1.07 | 0.0 | 1.1362 | 0.0 | 1.2382 | | tacotron2 | 64 | 0.9753 | 0.7396 | 0.9445 | 0.5789 | 0.0 | 0.8532 | | hf_Longformer | 2 | 0.9548 | 0.8658 | 0.8861 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.0551 | 8.2705 | 11.9694 | nan | 404.0512 | 400.5272 | | timm_efficientdet | 1 | 20.0986 | 36.9452 | nan | nan | 181.0065 | 177.39 | | hf_T5_large | 2 | 14.2944 | 40.4703 | nan | nan | 126.1417 | 124.5608 | | timm_vision_transformer_large | 8 | 2.8702 | 14.9882 | nan | nan | 79.3261 | 78.2292 | | timm_resnest | 32 | 0.6373 | 2.5166 | 3.8409 | nan | 65.1173 | 63.749 | | timm_vision_transformer | 8 | 0.9538 | 4.5377 | 6.7128 | 89.8577 | 50.5636 | 50.7287 | | densenet121 | 4 | 2.2659 | 12.1592 | 19.1991 | 374.1897 | 47.4262 | 46.7948 | | pytorch_stargan | 16 | 0.4179 | 1.9947 | 2.9492 | nan | 47.0992 | 49.3488 | | hf_BigBird | 2 | 8.3128 | 15.2347 | 30.5906 | 117.7334 | 44.6108 | 29.3758 | | attention_is_all_you_need_pytorch | 256 | 1.322 | 7.0691 | nan | 163.9444 | 44.3986 | 43.4331 | | BERT_pytorch | 16 | 1.7188 | 7.7794 | nan | nan | 39.8525 | 38.4993 | | pytorch_struct | 200 | 0.2864 | 0.8537 | 1.4643 | 7.6108 | 35.0095 | 36.0667 | | hf_Bart | 4 | 1.8242 | 8.9575 | nan | nan | 34.2189 | 32.8093 | | mobilenet_v3_large | 32 | 0.981 | 4.7585 | 7.2409 | 148.5016 | 30.6943 | 30.3509 | | hf_T5 | 8 | 2.4757 | 8.7652 | nan | 101.5623 | 30.6703 | 29.1808 | | timm_nfnet | 128 | 2.1193 | 7.2178 | nan | 174.2242 | 30.4685 | 29.4305 | | timm_regnet | 32 | 2.3529 | 7.903 | 19.8092 | 193.3708 | 29.765 | 28.6234 | | fastNLP_Bert | 6 | 1.8376 | 7.1098 | 11.2853 | nan | 29.5312 | 27.8874 | | timm_efficientnet | 32 | 1.8673 | 6.8084 | 15.6737 | nan | 28.7022 | 28.409 | | speech_transformer | 32 | 1.9368 | 9.0476 | 67.4148 | nan | 26.9693 | 26.6609 | | hf_Reformer | 4 | 2.496 | 4.6586 | 9.0728 | nan | 26.5917 | 21.0926 | | mnasnet1_0 | 32 | 0.8924 | 4.3798 | 6.6042 | 109.7858 | 23.3259 | 22.5575 | | resnext50_32x4d | 8 | 0.9553 | 4.5636 | 6.7393 | 112.0758 | 22.7441 | 22.1423 | | hf_Albert | 8 | 1.404 | 6.4451 | 9.9566 | nan | 22.5505 | 22.1413 | | resnet50 | 32 | 0.9387 | 4.5142 | 6.8416 | 124.9723 | 22.3364 | 21.9524 | | timm_vovnet | 32 | 1.5241 | 4.3534 | 9.9437 | 77.6476 | 22.3006 | 21.555 | | hf_GPT2 | 4 | 1.6104 | 6.3503 | nan | 80.7061 | 21.6341 | 20.8164 | | hf_Bert | 4 | 1.6523 | 6.9093 | 10.0522 | nan | 20.5768 | 20.3234 | | shufflenet_v2_x1_0 | 128 | 1.0829 | 5.0908 | 7.5299 | 124.5665 | 19.7158 | 19.8303 | | Super_SloMo | 6 | 1.0861 | 4.6513 | 6.4975 | nan | 19.0008 | 18.5585 | | mobilenet_v2 | 96 | 0.8511 | 4.5361 | 6.8953 | nan | 18.6629 | 18.9615 | | Background_Matting | 4 | 1.001 | 4.6047 | 6.7624 | 94.2167 | 18.5161 | 17.7744 | | functorch_dp_cifar10 | 64 | 0.3834 | 1.6597 | 2.4628 | nan | 16.7074 | 16.2944 | | hf_DistilBert | 8 | 0.6236 | 3.4005 | 7.7697 | 54.3262 | 15.4004 | 14.5265 | | resnet18 | 16 | 0.4448 | 1.7963 | 2.6095 | 43.3137 | 14.1331 | 14.0224 | | pytorch_unet | 1 | 0.5016 | 2.0435 | 3.008 | 42.3848 | 9.0794 | 8.6099 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4286 | 1.9679 | 2.914 | nan | 9.0695 | 8.6683 | | LearningToPaint | 96 | 0.4644 | 1.9258 | 2.8681 | 54.8198 | 8.2427 | 7.6678 | | squeezenet1_1 | 32 | 0.2614 | 0.9656 | 1.4073 | 6.6937 | 4.6883 | 4.377 | | vgg16 | 64 | 0.1819 | 0.6575 | 1.1093 | 5.8506 | 4.1283 | 3.8035 | | drq | 1 | 0.162 | 0.4862 | 0.8772 | 6.4039 | 4.072 | 3.5474 | | nvidia_deeprecommender | 256 | 0.2309 | 0.5288 | 0.8091 | 7.2115 | 3.9591 | 3.5653 | | soft_actor_critic | 256 | 0.2144 | 0.3645 | 0.5417 | 3.9129 | 3.5713 | 2.9417 | | alexnet | 128 | 0.1738 | 0.4402 | 0.7415 | 5.4695 | 3.4132 | 3.2329 | | dcgan | 32 | 0.1785 | 0.4279 | 0.66 | 6.0006 | 2.8483 | 2.7188 | | lennard_jones | 1000 | 0.1602 | 0.3419 | 0.5351 | 4.0467 | 2.1477 | 1.9437 | | tts_angular | 64 | 0.2285 | 0.2763 | 0.4028 | 1.4687 | 1.9327 | 1.7092 | | demucs | 4 | 0.3584 | 0.3541 | 0.3506 | 0.3512 | 0.2701 | 0.2602 | | tacotron2 | 64 | 18.1068 | 32.3032 | 50.9003 | 107.0881 | nan | 68.7307 | | hf_GPT2_large | 4 | 5.5979 | 21.3071 | nan | nan | nan | 52.9257 | | dlrm | 2048 | 0.4665 | 0.845 | nan | 5.9234 | nan | 3.2649 | | hf_Longformer | 2 | 6.4499 | 14.2281 | 57.718 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.9892 | 0.7707 | 0.2719 | nan | 1.2042 | 1.2299 | | hf_Albert | 8 | 0.9814 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3316 | nan | 1.1102 | 1.1145 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.312 | nan | 1.0603 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0536 | 1.2945 | | timm_nfnet | 128 | 0.9691 | 0.8985 | nan | 0.7873 | 1.0337 | 1.1293 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | 0.7853 | 1.0179 | 1.1759 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | Background_Matting | 4 | 1.0059 | 0.9548 | 0.3708 | 0.9233 | 0.9831 | 1.0338 | | timm_efficientdet | 1 | 1.0254 | 0.8401 | nan | nan | 0.9822 | 1.011 | | BERT_pytorch | 16 | 1.0 | 0.8822 | nan | nan | 0.9743 | 1.1226 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | 0.8601 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0018 | 0.8727 | 0.424 | nan | 0.951 | 1.0082 | | timm_regnet | 32 | 0.9945 | 0.8449 | 0.35 | 0.639 | 0.9372 | 1.0312 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 0.9059 | 0.9309 | 1.252 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.7273 | 0.911 | 1.0853 | | yolov3 | 16 | 0.985 | 0.8338 | 0.3517 | nan | 0.901 | 1.0402 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8358 | nan | nan | 0.879 | 0.9541 | | densenet121 | 4 | 0.9883 | 0.866 | 0.3662 | 0.7966 | 0.876 | 1.0026 | | timm_resnest | 32 | 0.9875 | 0.8721 | 0.3485 | nan | 0.876 | 0.9969 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3902 | nan | 0.8735 | 0.942 | | squeezenet1_1 | 32 | 0.9595 | 0.7951 | 0.346 | 0.5757 | 0.8731 | 1.0627 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8419 | 0.3593 | 0.6692 | 0.8727 | 0.9966 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | resnet50 | 32 | 0.9888 | 0.8617 | 0.3559 | 0.733 | 0.8647 | 0.8839 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3412 | 0.8347 | 0.8384 | 0.9049 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.6247 | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9102 | 0.831 | nan | nan | 0.8232 | 0.9878 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4543 | 0.9208 | 0.8111 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.7444 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8503 | 0.3453 | 0.6025 | 0.7902 | 0.816 | | pytorch_stargan | 16 | 0.9952 | 0.9707 | 0.4259 | nan | 0.7794 | 0.8863 | | timm_vovnet | 32 | 0.9895 | 0.7676 | 0.3403 | 0.7217 | 0.7791 | 0.8856 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.6125 | 0.7633 | 1.0588 | | resnext50_32x4d | 8 | 0.9947 | 0.8545 | 0.388 | 0.7725 | 0.7622 | 0.7746 | | mnasnet1_0 | 32 | 0.9788 | 0.8617 | 0.3406 | 0.6588 | 0.7529 | 0.7734 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8295 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9245 | 0.7232 | 0.3844 | 0.5479 | 0.7365 | 0.9262 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3915 | 0.8515 | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3949 | 0.5571 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4353 | | hf_Reformer | 4 | 0.3764 | 0.9847 | 0.3481 | nan | 0.3629 | 0.9878 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1354 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | 0.7306 | | tacotron2 | 64 | 0.9866 | 0.3963 | 0.3142 | 0.3471 | nan | 0.4114 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.349 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0248 | 0.8096 | 0.0 | 0.0 | 5.4064 | 1.6345 | | MobileBertForMaskedLM | 32 | 1.0169 | 0.8256 | 0.0 | 0.6972 | 4.5957 | 1.774 | | MT5ForConditionalGeneration | 8 | 1.0205 | 0.8635 | 0.0 | 0.0 | 3.9333 | 2.4533 | | MobileBertForQuestionAnswering | 64 | 1.0196 | 0.8244 | 0.0 | 0.0 | 3.7504 | 1.7793 | | CamemBert | 1 | 1.035 | 0.8291 | 1.7417 | 0.0 | 3.6211 | 1.7747 | | DistillGPT2 | 1 | 1.0288 | 0.8684 | 1.3107 | 0.2485 | 3.1566 | 1.9936 | | MegatronBertForQuestionAnswering | 16 | 1.0319 | 0.8541 | 1.0784 | 0.0 | 2.3586 | 1.8067 | | PLBartForConditionalGeneration | 16 | 1.0196 | 0.8262 | 0.0 | 0.0 | 2.3474 | 1.7369 | | GPT2ForSequenceClassification | 4 | 1.0004 | 0.9762 | 0.0 | 0.5016 | 2.3273 | 2.2752 | | M2M100ForConditionalGeneration | 8 | 1.0604 | 0.8585 | 1.3156 | 0.0 | 2.2936 | 1.824 | | ElectraForQuestionAnswering | 64 | 1.0006 | 0.978 | 0.7676 | 0.0 | 2.1278 | 2.066 | | T5Small | 1 | 1.0195 | 0.8765 | 0.0 | 0.0 | 1.883 | 1.4207 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.979 | 0.7744 | 0.0 | 1.8596 | 1.8019 | | MegatronBertForCausalLM | 16 | 1.0339 | 0.8559 | 0.9669 | 0.0 | 1.8431 | 1.7772 | | PegasusForConditionalGeneration | 16 | 1.0083 | 0.83 | 0.9117 | 0.0 | 1.8403 | 1.5561 | | ElectraForCausalLM | 32 | 0.9998 | 0.9428 | 0.7184 | 0.0 | 1.8286 | 1.8318 | | XGLMForCausalLM | 8 | 1.0103 | 0.8217 | 0.9313 | 0.2567 | 1.7822 | 1.592 | | MBartForConditionalGeneration | 16 | 1.0091 | 0.8374 | 0.0 | 0.0 | 1.7109 | 1.5821 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.8859 | 0.0 | 0.0 | 1.6673 | 1.6564 | | AlbertForMaskedLM | 4 | 1.0002 | 0.8855 | 0.0 | 0.0 | 1.6553 | 1.6459 | | LayoutLMForMaskedLM | 16 | 1.0004 | 0.9711 | 0.7558 | 0.0 | 1.6522 | 1.6375 | | T5ForConditionalGeneration | 4 | 1.0073 | 0.9249 | 0.0 | 0.9715 | 1.6158 | 1.5837 | | OPTForCausalLM | 32 | 1.0108 | 0.9292 | 0.0 | 0.3045 | 1.5537 | 1.534 | | Speech2Text2ForCausalLM | 128 | 1.002 | 0.9432 | 0.7172 | 0.0 | 1.5431 | 1.549 | | RobertaForQuestionAnswering | 128 | 0.9998 | 0.9841 | 0.7804 | 0.0 | 1.5065 | 1.4773 | | DistilBertForQuestionAnswering | 64 | 1.0009 | 0.9697 | 0.7429 | 0.2953 | 1.4991 | 1.4511 | | DebertaForMaskedLM | 4 | 0.928 | 0.722 | 0.8054 | 0.0 | 1.4985 | 1.1526 | | BertForQuestionAnswering | 128 | 0.9995 | 0.9751 | 0.7791 | 0.0 | 1.4959 | 1.4688 | | BartForConditionalGeneration | 2 | 1.0041 | 0.9603 | 0.0 | 0.3114 | 1.4617 | 1.4232 | | RobertaForCausalLM | 64 | 1.0002 | 0.959 | 0.753 | 0.0 | 1.447 | 1.4285 | | BartForCausalLM | 4 | 1.0013 | 0.9689 | 0.0 | 0.0 | 1.4435 | 1.4435 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.9262 | 0.0 | 0.0 | 1.4431 | 1.4344 | | BertForMaskedLM | 64 | 1.0 | 0.9574 | 0.7405 | 0.0 | 1.3571 | 1.3345 | | PLBartForCausalLM | 32 | 1.0063 | 0.9405 | 0.8304 | 0.0 | 1.284 | 1.2926 | | BlenderbotSmallForCausalLM | 64 | 1.0021 | 0.9273 | 0.7171 | 0.0 | 1.2705 | 1.2757 | | DistilBertForMaskedLM | 64 | 1.0009 | 0.952 | 0.7094 | 0.3197 | 1.2663 | 1.2677 | | MBartForCausalLM | 32 | 1.0017 | 0.9465 | 0.0 | 0.0 | 1.2531 | 1.1972 | | TrOCRForCausalLM | 32 | 1.0016 | 0.9482 | 0.0 | 0.0 | 1.1932 | 1.1916 | | PegasusForCausalLM | 32 | 0.9994 | 0.9546 | 0.7507 | 0.0 | 1.1832 | 1.1834 | | BigBird | 1 | 0.9873 | 0.9075 | 1.046 | 0.0 | 1.1504 | 1.0244 | | DebertaForQuestionAnswering | 8 | 0.993 | 0.8952 | 0.7235 | 0.0 | 1.1447 | 1.2392 | | AllenaiLongformerBase | 1 | 0.9362 | 0.7344 | 0.8571 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.7763 | 13.2678 | 52.568 | 269.1301 | 103.8703 | 101.2533 | | DebertaForQuestionAnswering | 8 | 5.38 | 10.9874 | 36.5472 | nan | 100.5832 | 39.878 | | DebertaForMaskedLM | 4 | 5.2481 | 11.0669 | 36.595 | nan | 99.1277 | 37.7527 | | M2M100ForConditionalGeneration | 8 | 3.358 | 16.2933 | 24.7226 | nan | 82.5414 | 77.5567 | | MobileBertForMaskedLM | 32 | 9.4557 | 32.5689 | nan | 492.5966 | 68.9158 | 68.1263 | | MobileBertForQuestionAnswering | 64 | 9.5005 | 32.6131 | nan | nan | 67.1142 | 65.3654 | | PegasusForConditionalGeneration | 16 | 3.4193 | 16.7618 | 27.2949 | nan | 54.1491 | 50.0072 | | YituTechConvBert | 1 | 2.4668 | 10.4659 | nan | nan | 52.6095 | 52.3874 | | BartForConditionalGeneration | 2 | 3.4516 | 16.579 | nan | 402.7854 | 51.0516 | 49.4442 | | MBartForConditionalGeneration | 16 | 3.5568 | 17.143 | nan | nan | 50.7604 | 49.3891 | | BigBird | 1 | 8.4262 | 15.4657 | 30.4566 | nan | 44.1037 | 29.3955 | | MegatronBertForCausalLM | 16 | 3.4904 | 14.3846 | 22.3266 | nan | 41.4969 | 40.3389 | | MT5ForConditionalGeneration | 8 | 3.9097 | 13.0054 | nan | nan | 40.7219 | 39.7461 | | MegatronBertForQuestionAnswering | 16 | 3.6465 | 14.2737 | 22.1787 | nan | 40.4733 | 39.0041 | | T5Small | 1 | 2.4009 | 8.7819 | nan | nan | 36.9071 | 35.9476 | | BlenderbotSmallForConditionalGeneration | 64 | 2.229 | 10.9979 | nan | nan | 36.4872 | 34.9881 | | LayoutLMForSequenceClassification | 16 | 2.0556 | 7.3941 | 11.2385 | nan | 32.7621 | 32.905 | | T5ForConditionalGeneration | 4 | 2.444 | 8.5421 | nan | 101.7905 | 32.2953 | 31.6754 | | PLBartForConditionalGeneration | 16 | 1.8132 | 8.7492 | nan | nan | 32.1922 | 31.7373 | | ElectraForCausalLM | 32 | 1.6857 | 7.0159 | 11.1598 | nan | 29.1629 | 27.1642 | | PegasusForCausalLM | 32 | 1.3055 | 6.3547 | 10.1303 | nan | 26.7318 | 24.9269 | | MBartForCausalLM | 32 | 1.2362 | 6.4772 | nan | nan | 25.1285 | 23.8111 | | LayoutLMForMaskedLM | 16 | 2.0709 | 7.4371 | 11.4446 | nan | 24.3638 | 22.892 | | TrOCRForCausalLM | 32 | 1.2481 | 6.2227 | nan | nan | 23.9729 | 22.9863 | | BartForCausalLM | 4 | 1.315 | 6.4863 | nan | nan | 23.9422 | 22.7904 | | OPTForCausalLM | 32 | 1.3086 | 6.4171 | nan | 110.2934 | 23.615 | 22.8173 | | BertForMaskedLM | 64 | 1.6024 | 6.8893 | 10.4971 | nan | 23.5486 | 22.3237 | | ElectraForQuestionAnswering | 64 | 1.694 | 7.0575 | 10.5839 | nan | 22.9687 | 21.8313 | | RobertaForCausalLM | 64 | 1.6046 | 7.0787 | 10.4319 | nan | 21.8595 | 21.1823 | | BertForQuestionAnswering | 128 | 1.6165 | 7.1029 | 10.4375 | nan | 21.2673 | 20.5205 | | GPT2ForSequenceClassification | 4 | 1.5539 | 6.6652 | nan | 84.4003 | 20.6516 | 19.9837 | | CamemBert | 1 | 1.752 | 6.8921 | 10.1026 | nan | 20.4039 | 20.0384 | | RobertaForQuestionAnswering | 128 | 1.628 | 6.9066 | 10.6611 | nan | 19.8156 | 18.7631 | | AlbertForMaskedLM | 4 | 1.4651 | 6.581 | nan | nan | 19.4997 | 18.4709 | | BlenderbotSmallForCausalLM | 64 | 0.8606 | 4.1511 | 6.5469 | nan | 19.2447 | 18.7193 | | AlbertForQuestionAnswering | 4 | 1.4395 | 6.7452 | nan | nan | 18.9033 | 17.4617 | | Speech2Text2ForCausalLM | 128 | 0.6924 | 3.2754 | 5.541 | nan | 17.439 | 16.2911 | | DistillGPT2 | 1 | 0.7913 | 3.2972 | 4.6709 | 56.0197 | 16.7913 | 17.0101 | | PLBartForCausalLM | 32 | 0.6331 | 3.2931 | 5.2445 | nan | 16.6319 | 16.5035 | | DistilBertForMaskedLM | 64 | 0.6242 | 3.2924 | 7.294 | 57.06 | 14.0913 | 13.7519 | | DistilBertForQuestionAnswering | 64 | 0.6424 | 3.3336 | 7.5351 | 55.2001 | 13.4076 | 12.984 | | AllenaiLongformerBase | 1 | 6.8848 | 15.1818 | 59.0666 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | 0.8823 | 1.0775 | 1.1632 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | 0.8978 | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 0.9708 | 1.0363 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | 0.8385 | 0.9662 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.8865 | nan | nan | 0.9567 | 1.1277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9417 | 1.0114 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 0.999 | 0.8994 | 0.3788 | nan | 0.9289 | 0.9789 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 0.81 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | 0.6772 | 0.925 | 1.0061 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.921 | 0.9877 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | nan | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0276 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | nan | nan | 0.8843 | 1.0294 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | 0.6373 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.8751 | 0.919 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.856 | 0.9327 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3979 | nan | 0.8549 | 0.9361 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | BigBird | 1 | 1.0008 | 0.9547 | 0.4478 | nan | 0.8178 | 1.0885 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8061 | 0.9309 | | XGLMForCausalLM | 8 | 0.9918 | 0.9231 | 0.4336 | 0.7755 | 0.8055 | 0.9902 | | DistillGPT2 | 1 | 0.9963 | 0.7984 | 0.4007 | 0.7468 | 0.7997 | 1.016 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | nan | nan | 0.7883 | 0.9276 | | M2M100ForConditionalGeneration | 8 | 1.0018 | 0.9606 | 0.4511 | nan | 0.752 | 1.022 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | 0.6446 | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3622 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9476 | 0.3854 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tnt_s_patch16_224 | 128 | 0.9999 | 0.9985 | 0.0 | 0.0 | 2.1216 | 2.0933 | | ghostnet_100 | 128 | 0.9996 | 0.9766 | 0.8592 | 0.0 | 1.9898 | 1.9442 | | xcit_large_24_p8_224 | 5 | 1.0007 | 0.0 | 0.0 | 0.0 | 1.8726 | 1.7398 | | lcnet_050 | 128 | 0.9486 | 0.9353 | 0.7879 | 1.1287 | 1.8461 | 1.7055 | | twins_pcpvt_base | 64 | 1.0052 | 0.9113 | 0.8602 | 0.0 | 1.8172 | 1.7162 | | regnety_002 | 128 | 0.98 | 0.9282 | 0.949 | 0.8588 | 1.6867 | 1.5467 | | volo_d1_224 | 64 | 0.9998 | 0.9947 | 0.0 | 0.0 | 1.5985 | 1.5652 | | dla102 | 128 | 1.0001 | 0.9961 | 0.8389 | 0.0 | 1.5825 | 1.5495 | | nfnet_l0 | 128 | 1.0003 | 0.8083 | 0.7156 | 0.9127 | 1.5595 | 1.46 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9465 | 0.0 | 0.0 | 1.5576 | 1.4959 | | hrnet_w18 | 128 | 0.9999 | 0.9944 | 0.8377 | 0.0 | 1.5421 | 1.4615 | | gmixer_24_224 | 128 | 1.0 | 0.843 | 0.0 | 0.0 | 1.5374 | 1.4813 | | coat_lite_mini | 128 | 1.0 | 0.9769 | 0.8589 | 0.0 | 1.5285 | 1.5057 | | resnest101e | 64 | 1.0 | 0.9923 | 0.8174 | 0.0 | 1.5281 | 1.4291 | | cait_m36_384 | 4 | 1.0003 | 0.8463 | 0.0 | 0.0 | 1.5138 | 1.4137 | | gluon_inception_v3 | 128 | 0.9998 | 0.9962 | 0.8466 | 1.1505 | 1.5043 | 1.4702 | | gmlp_s16_224 | 128 | 0.9998 | 0.9435 | 0.0 | 0.5352 | 1.504 | 1.5173 | | inception_v3 | 128 | 0.9998 | 0.993 | 0.8546 | 1.1511 | 1.5028 | 1.4681 | | adv_inception_v3 | 128 | 0.9999 | 0.9963 | 0.854 | 1.1505 | 1.5004 | 1.4738 | | dm_nfnet_f0 | 128 | 0.9991 | 0.9982 | 0.0 | 1.0879 | 1.4992 | 1.4291 | | convnext_base | 64 | 0.9999 | 0.998 | 0.0 | 0.0 | 1.4924 | 1.4476 | | res2net50_14w_8s | 128 | 1.0 | 0.9943 | 0.8116 | 1.1475 | 1.4584 | 1.4025 | | crossvit_9_240 | 128 | 0.9998 | 0.9898 | 0.8391 | 0.0 | 1.4511 | 1.4209 | | mobilenetv3_large_100 | 128 | 0.9555 | 0.9442 | 0.7837 | 1.1588 | 1.4491 | 1.4696 | | selecsls42b | 128 | 0.9998 | 0.9956 | 0.8411 | 1.2946 | 1.4373 | 1.4121 | | mnasnet_100 | 128 | 0.9522 | 0.9438 | 0.7895 | 1.2882 | 1.4363 | 1.459 | | res2next50 | 128 | 0.9993 | 0.9966 | 0.8295 | 1.1641 | 1.416 | 1.3461 | | mobilenetv2_100 | 128 | 0.9514 | 0.9369 | 0.722 | 0.0 | 1.404 | 1.4304 | | fbnetv3_b | 128 | 0.9508 | 0.9403 | 0.7748 | 0.0 | 1.3916 | 1.4026 | | jx_nest_base | 32 | 0.9996 | 0.9929 | 0.0 | 0.0 | 1.3911 | 1.354 | | ese_vovnet19b_dw | 128 | 0.971 | 0.9653 | 0.7675 | 1.1943 | 1.3797 | 1.3785 | | convit_base | 64 | 1.0 | 0.9967 | 0.0 | 0.0 | 1.3725 | 1.33 | | spnasnet_100 | 128 | 0.9456 | 0.9386 | 0.779 | 0.0 | 1.372 | 1.3881 | | mobilevit_s | 64 | 0.9728 | 0.8141 | 0.6558 | 0.0 | 1.3694 | 1.3574 | | tf_efficientnet_b0 | 128 | 0.9658 | 0.8071 | 0.6674 | 0.0 | 1.3512 | 1.3537 | | fbnetc_100 | 128 | 0.9524 | 0.943 | 0.7932 | 0.0 | 1.3505 | 1.3743 | | cspdarknet53 | 64 | 0.9423 | 0.9345 | 0.757 | 1.2615 | 1.328 | 1.3454 | | poolformer_m36 | 64 | 0.9999 | 0.9982 | 0.8083 | 0.0 | 1.326 | 1.2929 | | pit_b_224 | 64 | 0.9996 | 0.9953 | 0.8212 | 0.6783 | 1.3221 | 1.3174 | | botnet26t_256 | 128 | 0.9798 | 0.9709 | 0.8099 | 0.0 | 1.2829 | 1.303 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9916 | 0.7974 | 0.5964 | 1.2809 | 1.2638 | | beit_base_patch16_224 | 64 | 1.0 | 0.9769 | 0.0 | 0.0 | 1.2776 | 1.2724 | | eca_botnext26ts_256 | 128 | 0.9808 | 0.8099 | 0.6704 | 1.1619 | 1.2774 | 1.2729 | | res2net101_26w_4s | 64 | 1.0034 | 1.04 | 0.7879 | 0.0 | 1.2763 | 1.242 | | rexnet_100 | 128 | 0.9635 | 0.8503 | 0.6885 | 0.0 | 1.2705 | 1.277 | | mixer_b16_224 | 128 | 1.0 | 0.9593 | 0.7779 | 0.0 | 1.2701 | 1.2697 | | visformer_small | 128 | 0.9996 | 1.0017 | 0.8445 | 0.0 | 1.2449 | 1.1785 | | tinynet_a | 128 | 0.9483 | 0.796 | 0.6473 | 0.0 | 1.2348 | 1.2305 | | pnasnet5large | 16 | 0.9993 | 0.9921 | 0.788 | 0.0 | 1.218 | 1.1852 | | sebotnet33ts_256 | 64 | 0.9661 | 0.8329 | 0.6782 | 1.0228 | 1.2093 | 1.2133 | | vit_base_patch16_224 | 64 | 1.0 | 0.9939 | 0.8349 | 0.633 | 1.1904 | 1.1792 | | tf_mixnet_l | 128 | 0.9804 | 0.9071 | 0.7964 | 0.0 | 1.1778 | 1.1704 | | mixnet_l | 128 | 0.9792 | 0.9053 | 0.7924 | 0.0 | 1.1604 | 1.1538 | | gluon_xception65 | 32 | 0.9996 | 0.9882 | 0.7547 | 0.0 | 1.1587 | 1.125 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9866 | 0.8227 | 0.0 | 1.1458 | 1.0693 | | dpn107 | 32 | 0.9296 | 0.9121 | 0.7408 | 0.0 | 1.1401 | 1.1542 | | repvgg_a2 | 128 | 0.9441 | 0.9346 | 0.7974 | 0.0 | 1.1225 | 1.1534 | | resmlp_12_224 | 128 | 1.0001 | 1.0075 | 0.7889 | 1.0499 | 1.0967 | 1.0932 | | gernet_l | 128 | 0.9478 | 0.9373 | 0.7703 | 1.073 | 1.0688 | 1.0775 | | convmixer_768_32 | 32 | 1.0001 | 0.9982 | 0.9227 | 0.0 | 1.0556 | 1.0505 | | eca_halonext26ts | 128 | 0.9808 | 0.8164 | 0.6787 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dla102 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.743 | 15.0441 | 25.6793 | nan | 132.2865 | 137.596 | | hrnet_w18 | 128 | 6.3667 | 30.2263 | 56.9074 | nan | 116.7186 | 112.6062 | | swin_base_patch4_window7_224 | 64 | 2.9136 | 13.1702 | nan | nan | 92.8969 | 91.1806 | | mobilevit_s | 64 | 1.9532 | 7.519 | 15.6073 | nan | 92.4091 | 89.1592 | | xcit_large_24_p8_224 | 5 | 3.2612 | nan | nan | nan | 90.1612 | 86.755 | | convnext_base | 64 | 1.6494 | 7.2007 | nan | nan | 89.2647 | 88.7272 | | pnasnet5large | 16 | 4.9874 | 22.3595 | 41.8793 | nan | 85.1385 | 82.4486 | | resnest101e | 64 | 3.4626 | 16.2941 | 27.0563 | nan | 79.5648 | 75.6414 | | cait_m36_384 | 4 | 3.4467 | 19.8671 | nan | nan | 78.1543 | 73.7075 | | coat_lite_mini | 128 | 1.2572 | 5.3598 | 8.6951 | nan | 73.1533 | 73.0098 | | jx_nest_base | 32 | 1.7592 | 9.958 | nan | nan | 71.7608 | 70.205 | | sebotnet33ts_256 | 64 | 1.6816 | 6.0061 | 14.139 | 189.4906 | 65.9922 | 64.4586 | | res2net101_26w_4s | 64 | 3.2398 | 16.4235 | 27.9273 | nan | 60.8515 | 58.5659 | | res2net50_14w_8s | 128 | 2.9405 | 14.4334 | 24.3333 | 434.4374 | 57.3301 | 52.1967 | | eca_botnext26ts_256 | 128 | 1.4447 | 4.865 | 10.9569 | 151.1598 | 54.8788 | 54.576 | | gmlp_s16_224 | 128 | 1.268 | 7.3811 | nan | 210.2524 | 51.4523 | 47.9334 | | botnet26t_256 | 128 | 1.38 | 4.2881 | 9.3865 | nan | 49.6223 | 47.9889 | | crossvit_9_240 | 128 | 1.6665 | 8.6169 | 13.4557 | nan | 48.3751 | 46.6416 | | poolformer_m36 | 64 | 1.9188 | 8.1894 | 13.3737 | nan | 47.8117 | 46.3945 | | volo_d1_224 | 64 | 1.6177 | 7.6012 | nan | nan | 44.9808 | 46.0485 | | dpn107 | 32 | 4.1081 | 13.4752 | 39.4787 | nan | 43.2118 | 40.5377 | | gluon_xception65 | 32 | 2.0746 | 10.9494 | 17.7833 | nan | 42.9164 | 41.7301 | | tnt_s_patch16_224 | 128 | 1.7872 | 10.6334 | nan | nan | 42.2572 | 38.8524 | | fbnetv3_b | 128 | 3.32 | 11.5083 | 27.7532 | nan | 40.6524 | 38.6252 | | gmixer_24_224 | 128 | 1.4609 | 8.3317 | nan | nan | 38.9343 | 37.4864 | | adv_inception_v3 | 128 | 1.7368 | 8.3295 | 13.2479 | 200.2897 | 38.3673 | 34.3582 | | ghostnet_100 | 128 | 3.0792 | 9.6619 | 14.4849 | nan | 37.3584 | 34.6981 | | swsl_resnext101_32x16d | 32 | 1.861 | 9.125 | 14.7024 | nan | 36.9721 | 34.7543 | | gluon_inception_v3 | 128 | 1.6688 | 8.3728 | 13.4305 | 200.5074 | 36.6428 | 35.1381 | | inception_v3 | 128 | 1.6454 | 8.5254 | 13.4077 | 206.128 | 36.5409 | 34.7246 | | tf_mixnet_l | 128 | 5.7821 | 13.0548 | 26.7004 | nan | 36.4412 | 35.0 | | dla102 | 128 | 1.8599 | 9.4706 | 15.1627 | nan | 35.341 | 32.9023 | | mixnet_l | 128 | 5.4491 | 12.4298 | 26.8147 | nan | 35.0254 | 32.7682 | | convit_base | 64 | 1.2618 | 6.0081 | nan | nan | 33.5754 | 34.5791 | | res2next50 | 128 | 1.7331 | 8.2436 | 13.2345 | 267.447 | 32.3653 | 30.7758 | | dm_nfnet_f0 | 128 | 2.2326 | 7.3143 | nan | 175.972 | 31.8961 | 30.0449 | | rexnet_100 | 128 | 2.0658 | 7.3432 | 17.3054 | nan | 30.9167 | 28.4911 | | tinynet_a | 128 | 2.1676 | 8.0224 | 19.766 | nan | 29.5338 | 28.6957 | | visformer_small | 128 | 1.0269 | 4.0371 | 6.2482 | nan | 28.9299 | 26.1134 | | mixer_b16_224 | 128 | 0.8269 | 3.8212 | 7.302 | nan | 28.1176 | 26.3667 | | convmixer_768_32 | 32 | 1.2523 | 6.2561 | 9.9159 | nan | 27.5396 | 28.1933 | | resmlp_12_224 | 128 | 0.7552 | 3.1515 | 6.726 | 65.314 | 27.1429 | 25.8542 | | deit_base_distilled_patch16_224 | 64 | 0.9952 | 4.7035 | 7.3818 | 94.444 | 25.8825 | 24.5656 | | vit_base_patch16_224 | 64 | 0.9923 | 4.6136 | 7.2218 | 87.6886 | 25.8625 | 27.2705 | | cspdarknet53 | 64 | 2.3698 | 7.3456 | 18.9527 | 159.721 | 25.8263 | 24.4125 | | tf_efficientnet_b0 | 128 | 1.8933 | 6.8279 | 16.3218 | nan | 25.7619 | 24.1697 | | fbnetc_100 | 128 | 2.0929 | 6.6994 | 17.4004 | nan | 24.7874 | 22.8128 | | spnasnet_100 | 128 | 2.1014 | 6.8206 | 16.976 | nan | 24.229 | 23.141 | | mobilenetv3_large_100 | 128 | 1.6034 | 5.7746 | 13.1852 | 159.4335 | 23.6703 | 22.1723 | | beit_base_patch16_224 | 64 | 1.2806 | 5.4403 | nan | nan | 23.599 | 21.8896 | | nfnet_l0 | 128 | 1.9029 | 7.2672 | 10.9529 | 165.4112 | 23.4337 | 22.545 | | mobilenetv2_100 | 128 | 1.6498 | 5.6189 | 12.985 | nan | 22.1313 | 19.82 | | pit_b_224 | 64 | 1.1465 | 5.3476 | 8.6372 | 111.2699 | 21.2334 | 20.4384 | | mnasnet_100 | 128 | 1.6802 | 5.2866 | 13.0574 | 127.9201 | 20.0666 | 19.0961 | | regnety_002 | 128 | 1.7286 | 5.5831 | 13.5336 | 139.2075 | 20.0438 | 19.236 | | gernet_l | 128 | 2.008 | 6.1939 | 15.1821 | 132.7212 | 20.0118 | 18.7783 | | repvgg_a2 | 128 | 2.0251 | 6.1043 | 15.9808 | nan | 19.7133 | 18.9425 | | selecsls42b | 128 | 0.8755 | 3.7633 | 5.8952 | 102.7094 | 18.463 | 16.9678 | | lcnet_050 | 128 | 1.0746 | 3.4142 | 7.6873 | 90.5216 | 15.364 | 14.2918 | | ese_vovnet19b_dw | 128 | 1.0727 | 3.0465 | 6.8435 | 68.9987 | 13.9774 | 13.3013 | | eca_halonext26ts | 128 | 1.6107 | 5.1786 | 11.364 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7887 | 0.2764 | nan | 1.37 | 1.5056 | | gmixer_24_224 | 128 | 0.9926 | 0.9248 | nan | nan | 1.3102 | 1.3732 | | gmlp_s16_224 | 128 | 0.9938 | 0.9495 | nan | 0.9045 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9877 | 0.7695 | 0.2666 | nan | 1.1888 | 1.3559 | | mobilevit_s | 64 | 0.993 | 0.7669 | 0.2733 | nan | 1.1832 | 1.3099 | | pnasnet5large | 16 | 1.0567 | 0.9911 | 0.3632 | nan | 1.159 | 1.2896 | | rexnet_100 | 128 | 0.9884 | 0.7848 | 0.2849 | nan | 1.147 | 1.3177 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.5588 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9983 | 0.9433 | 0.3413 | nan | 1.1018 | 1.1173 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3474 | nan | 1.0595 | 1.1461 | | mobilenetv2_100 | 128 | 0.9859 | 0.7635 | 0.3108 | nan | 1.0588 | 1.1524 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9692 | 0.8981 | nan | 0.7871 | 1.0336 | 1.1292 | | nfnet_l0 | 128 | 0.9887 | 0.8167 | 0.2678 | 0.524 | 1.0319 | 1.1803 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.7316 | 0.9907 | 1.2281 | | fbnetv3_b | 128 | 0.9872 | 0.783 | 0.3151 | nan | 0.986 | 1.043 | | convmixer_768_32 | 32 | 0.9972 | 0.9785 | 0.3447 | nan | 0.9759 | 0.9792 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9749 | 1.0803 | | visformer_small | 128 | 0.9897 | 0.9255 | 0.3467 | nan | 0.9613 | 1.0514 | | dla102 | 128 | 0.9684 | 0.9114 | 0.3366 | nan | 0.9555 | 1.0311 | | ghostnet_100 | 128 | 0.9758 | 0.8691 | 0.337 | nan | 0.9485 | 1.0703 | | tf_mixnet_l | 128 | 0.9907 | 0.8555 | 0.2874 | nan | 0.9364 | 1.0873 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9932 | | mobilenetv3_large_100 | 128 | 0.9773 | 0.8402 | 0.3302 | 0.5782 | 0.9298 | 1.0259 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9775 | | ese_vovnet19b_dw | 128 | 0.9855 | 0.8559 | 0.3271 | 0.7015 | 0.9176 | 1.0681 | | swsl_resnext101_32x16d | 32 | 0.9988 | 0.8771 | 0.3667 | nan | 0.9094 | 0.9794 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | nan | 0.9073 | 0.9799 | | dpn107 | 32 | 0.9981 | 0.9084 | 0.3529 | nan | 0.9063 | 0.996 | | res2net101_26w_4s | 64 | 0.994 | 0.9149 | 0.3339 | nan | 0.8973 | 0.9734 | | gluon_xception65 | 32 | 0.9955 | 0.8848 | 0.3346 | nan | 0.8967 | 0.9753 | | inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | adv_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0257 | | gluon_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | hrnet_w18 | 128 | 0.9914 | 0.9175 | 0.3348 | nan | 0.8966 | 1.0033 | | fbnetc_100 | 128 | 0.9783 | 0.8475 | 0.33 | nan | 0.8954 | 0.9865 | | selecsls42b | 128 | 0.9796 | 0.8773 | 0.3532 | 0.7509 | 0.8919 | 0.9903 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8683 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8697 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9789 | 0.8799 | 0.3346 | nan | 0.8798 | 0.9821 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | 0.7788 | 0.8764 | 0.9737 | | convnext_base | 64 | 1.003 | 0.9263 | nan | nan | 0.8763 | 0.9864 | | res2next50 | 128 | 0.991 | 0.9094 | 0.3201 | 0.7727 | 0.8709 | 0.9666 | | mnasnet_100 | 128 | 0.976 | 0.8703 | 0.335 | 0.654 | 0.8707 | 0.98 | | mixnet_l | 128 | 0.99 | 0.8442 | 0.2717 | nan | 0.8702 | 1.0089 | | gernet_l | 128 | 0.9791 | 0.85 | 0.3444 | 0.7405 | 0.8616 | 0.9861 | | cspdarknet53 | 64 | 0.9914 | 0.8402 | 0.324 | 0.6522 | 0.8604 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.8639 | 0.3308 | nan | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.943 | 0.7564 | 0.3359 | 0.676 | 0.8449 | 0.9433 | | regnety_002 | 128 | 0.9509 | 0.7946 | 0.3398 | 0.5703 | 0.8352 | 1.0081 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | nan | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | nan | 0.8032 | 1.0344 | | repvgg_a2 | 128 | 0.9769 | 0.7822 | 0.3409 | nan | 0.7908 | 0.9914 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | 0.6208 | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | sebotnet33ts_256 | 64 | 0.9929 | 0.7076 | 0.3212 | 0.577 | 0.7451 | 0.8294 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | nan | 0.6707 | 0.8618 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2669 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/nCPH3TV.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/Y7as1MA.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/CYTq3H5.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 75%, 41/55 | 49%, 21/43  | 38%, 23/61  |
|    nvprims_nvfuser     | 71%, 39/55 |  16%, 7/43  | 48%, 29/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.05x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.03x    |    1.00x    |    1.14x    |
|        inductor        |   1.41x    |    1.30x    |    1.25x    |
| inductor_no_cudagraphs |   1.25x    |    1.23x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.10    |    2.28     |    1.90     |
|       aot_eager        |    5.80    |    7.42     |    6.95     |
|     aot_cudagraphs     |    7.56    |    15.92    |    13.13    |
|    nvprims_nvfuser     |   75.23    |   133.60    |   150.37    |
|        inductor        |   28.79    |    29.22    |    34.02    |
| inductor_no_cudagraphs |   28.59    |    24.81    |    32.58    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|    nvprims_nvfuser     |   0.81x    |    0.83x    |    0.81x    |
|        inductor        |   0.84x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   0.99x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/nhGH716.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/gOFhjEF.png)

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0069 | 0.9961 | 1.7393 | 0.721 | 4.1047 | 1.414 | | timm_efficientdet | 1 | 0.9813 | 0.8827 | 0.0 | 0.0 | 3.7007 | 1.6748 | | timm_vision_transformer | 8 | 1.0011 | 0.9713 | 1.6259 | 0.6747 | 2.5579 | 1.4174 | | functorch_dp_cifar10 | 64 | 0.9965 | 1.0315 | 1.3256 | 0.0 | 2.5138 | 1.347 | | BERT_pytorch | 16 | 1.0092 | 0.8989 | 0.0 | 1.0067 | 2.0983 | 2.0823 | | drq | 1 | 1.0097 | 0.8454 | 1.2763 | 0.7192 | 1.8673 | 1.0853 | | pytorch_struct | 200 | 0.9956 | 0.7529 | 0.8818 | 0.8195 | 1.8293 | 1.156 | | lennard_jones | 1000 | 0.9553 | 0.8588 | 1.0444 | 0.6993 | 1.7753 | 0.9588 | | mobilenet_v3_large | 32 | 1.0053 | 1.1245 | 0.8854 | 0.8676 | 1.7162 | 1.417 | | hf_Albert | 8 | 1.0009 | 0.998 | 0.7529 | 0.0 | 1.6493 | 1.6401 | | hf_T5_large | 2 | 1.0271 | 0.9079 | 0.0 | 0.0 | 1.6263 | 1.6883 | | speech_transformer | 32 | 1.0032 | 0.9074 | 1.5096 | 0.0 | 1.5543 | 1.559 | | timm_resnest | 32 | 0.9997 | 1.0022 | 0.8043 | 1.1743 | 1.5187 | 1.4535 | | resnext50_32x4d | 8 | 0.9988 | 1.1658 | 0.9157 | 0.7667 | 1.5121 | 1.3468 | | hf_GPT2 | 4 | 1.0059 | 0.9753 | 0.7387 | 0.4086 | 1.4953 | 1.4984 | | timm_nfnet | 128 | 0.9996 | 1.0004 | 0.0 | 1.2384 | 1.4757 | 1.4232 | | shufflenet_v2_x1_0 | 128 | 0.9995 | 1.0025 | 0.7663 | 0.9255 | 1.4551 | 1.4001 | | mobilenet_v2_quantized_qat | 96 | 1.0005 | 0.9763 | 0.0 | 1.4423 | 1.4346 | 1.4367 | | mobilenet_v2 | 96 | 1.0 | 0.9996 | 0.7312 | 1.2763 | 1.4234 | 1.4099 | | fastNLP_Bert | 6 | 0.9984 | 0.9767 | 0.7526 | 0.8355 | 1.4206 | 1.3918 | | soft_actor_critic | 256 | 1.0007 | 0.8069 | 1.0562 | 0.6881 | 1.4124 | 0.9293 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.997 | 1.0832 | 1.0023 | 0.0 | 1.412 | 1.332 | | resnet50_quantized_qat | 32 | 1.0012 | 0.9749 | 0.0 | 1.2532 | 1.3849 | 1.3881 | | resnet18 | 16 | 1.0038 | 1.1071 | 0.8926 | 0.9149 | 1.379 | 1.248 | | mnasnet1_0 | 32 | 1.0009 | 1.0541 | 0.8032 | 0.9785 | 1.3653 | 1.2868 | | squeezenet1_1 | 32 | 1.0034 | 1.0186 | 0.8088 | 0.8355 | 1.3296 | 1.292 | | hf_Bart | 4 | 1.0126 | 0.9812 | 0.0 | 0.7755 | 1.2837 | 1.1976 | | dcgan | 32 | 0.9819 | 1.0158 | 0.991 | 0.8046 | 1.2797 | 1.0614 | | pytorch_stargan | 16 | 0.9989 | 1.0842 | 0.9294 | 0.0 | 1.2714 | 1.2498 | | LearningToPaint | 96 | 0.9995 | 1.0009 | 0.8046 | 1.0092 | 1.2116 | 1.1754 | | hf_Bert | 4 | 1.0263 | 0.9925 | 0.732 | 0.7394 | 1.2109 | 1.1789 | | resnet50 | 32 | 0.9991 | 0.9931 | 0.7601 | 1.1132 | 1.2056 | 1.1693 | | pytorch_unet | 1 | 0.9998 | 0.9977 | 0.8451 | 1.0964 | 1.1985 | 1.1864 | | timm_efficientnet | 32 | 0.9513 | 0.8 | 0.6235 | 0.8682 | 1.1805 | 1.1733 | | Super_SloMo | 6 | 0.9999 | 0.9978 | 0.8659 | 0.0 | 1.1803 | 1.1647 | | hf_DistilBert | 8 | 1.0001 | 0.9567 | 0.687 | 0.4406 | 1.1775 | 1.1817 | | vgg16 | 64 | 0.9999 | 0.9988 | 0.8583 | 0.9976 | 1.1728 | 1.1681 | | alexnet | 128 | 0.9987 | 0.9992 | 0.8033 | 1.0022 | 1.1603 | 1.1634 | | hf_Reformer | 4 | 0.998 | 1.0019 | 0.9881 | 0.0 | 1.131 | 1.1404 | | timm_regnet | 32 | 0.9651 | 0.962 | 0.7846 | 1.0998 | 1.1285 | 1.0921 | | Background_Matting | 4 | 1.0 | 1.0218 | 0.8617 | 1.0768 | 1.1178 | 1.1098 | | yolov3 | 16 | 0.9994 | 0.9951 | 0.7903 | 0.0 | 1.0944 | 1.0801 | | hf_BigBird | 2 | 0.9905 | 0.9437 | 0.9566 | 0.9137 | 1.0903 | 0.9983 | | attention_is_all_you_need_pytorch | 256 | 0.9998 | 0.967 | 0.0 | 0.7122 | 1.0643 | 1.0498 | | timm_vision_transformer_large | 8 | 0.999 | 0.9941 | 0.0 | 0.0 | 1.0495 | 1.0364 | | tts_angular | 64 | 0.9847 | 0.9712 | 0.9975 | 0.9785 | 1.0071 | 1.0082 | | demucs | 4 | 1.0 | 1.0002 | 0.9997 | 1.0001 | 0.9997 | 1.0003 | | timm_vovnet | 32 | 0.9088 | 0.9039 | 0.7139 | 0.972 | 0.9865 | 1.0181 | | dlrm | 2048 | 0.8411 | 0.0 | 0.0 | 0.0 | 0.9532 | 1.3069 | | nvidia_deeprecommender | 256 | 0.9987 | 0.9632 | 0.5853 | 0.9599 | 0.9042 | 0.9635 | | hf_GPT2_large | 4 | 1.0002 | 0.9818 | 0.0 | 0.0 | 0.0 | 1.4739 | | hf_T5 | 8 | 1.0012 | 0.9578 | 0.0 | 0.9343 | 0.0 | 1.5681 | | tacotron2 | 64 | 0.9734 | 0.8333 | 0.0 | 0.7296 | 0.0 | 0.8948 | | hf_Longformer | 2 | 0.9631 | 0.8881 | 0.8166 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | pass | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.7688 | 7.0092 | 9.8922 | nan | 367.6646 | 363.1789 | | timm_efficientdet | 1 | 19.2967 | 33.4615 | nan | nan | 122.7199 | 121.1384 | | hf_T5_large | 2 | 13.9951 | 34.3239 | nan | nan | 102.0016 | 99.4799 | | timm_vision_transformer_large | 8 | 2.2722 | 11.1426 | nan | nan | 50.548 | 49.2194 | | attention_is_all_you_need_pytorch | 256 | 1.109 | 5.6432 | nan | 158.9743 | 45.4254 | 44.5026 | | densenet121 | 4 | 2.0438 | 9.6353 | 15.7762 | 193.9415 | 41.5248 | 40.6806 | | timm_resnest | 32 | 0.5638 | 1.9977 | 2.9977 | 62.8601 | 39.6262 | 38.201 | | hf_BigBird | 2 | 7.4013 | 12.5991 | 25.1969 | 102.4214 | 38.8514 | 24.7667 | | timm_vision_transformer | 8 | 0.7614 | 3.4671 | 5.0543 | 87.9386 | 31.8094 | 30.7745 | | hf_Bart | 4 | 1.3997 | 6.3184 | nan | 158.194 | 27.9396 | 26.7652 | | timm_nfnet | 128 | 1.9876 | 6.1866 | nan | 165.9635 | 27.1415 | 26.6161 | | BERT_pytorch | 16 | 1.453 | 5.8575 | nan | 148.1899 | 27.0238 | 26.085 | | pytorch_stargan | 16 | 0.3654 | 1.6552 | 2.4823 | nan | 26.793 | 26.7937 | | resnet50_quantized_qat | 32 | 1.0575 | 7.1773 | nan | 210.4359 | 26.1201 | 26.0372 | | mobilenet_v2_quantized_qat | 96 | 1.2288 | 7.4929 | nan | 233.6475 | 25.8484 | 25.6991 | | fastNLP_Bert | 6 | 1.4622 | 5.2565 | 8.6938 | 123.687 | 25.3516 | 24.2577 | | speech_transformer | 32 | 1.5557 | 6.4963 | 45.7448 | nan | 25.1691 | 25.3366 | | pytorch_struct | 200 | 0.2435 | 0.6241 | 1.1545 | 5.6949 | 23.0261 | 18.3601 | | mobilenet_v3_large | 32 | 0.8265 | 3.9578 | 5.7024 | 106.5177 | 22.6389 | 21.8756 | | timm_regnet | 32 | 2.1807 | 6.51 | 17.0661 | 129.1582 | 22.5659 | 21.9106 | | timm_efficientnet | 32 | 1.6236 | 5.5284 | 13.6763 | 117.4943 | 22.5146 | 21.8356 | | hf_Reformer | 4 | 1.5104 | 2.7198 | 5.4445 | nan | 19.0158 | 15.7253 | | hf_Bert | 4 | 1.3222 | 5.1044 | 7.5147 | 129.1963 | 18.2255 | 17.5697 | | mnasnet1_0 | 32 | 0.7852 | 3.6422 | 5.1815 | 77.894 | 17.5981 | 17.3646 | | timm_vovnet | 32 | 1.4374 | 3.7248 | 8.8474 | 59.6136 | 17.5309 | 16.6725 | | shufflenet_v2_x1_0 | 128 | 0.8755 | 4.0782 | 6.2147 | 100.4346 | 17.2823 | 16.4651 | | resnet50 | 32 | 0.7981 | 3.7551 | 5.5494 | 90.0269 | 17.1302 | 16.8021 | | hf_Albert | 8 | 1.0331 | 4.395 | 7.1887 | nan | 16.9665 | 16.1732 | | resnext50_32x4d | 8 | 0.8239 | 3.6424 | 5.4722 | 76.0101 | 16.5327 | 16.3087 | | mobilenet_v2 | 96 | 0.7444 | 3.6617 | 5.8714 | 110.946 | 16.4444 | 15.268 | | hf_GPT2 | 4 | 1.2836 | 4.7852 | 7.5164 | 96.7642 | 16.3791 | 15.8793 | | Super_SloMo | 6 | 0.9741 | 3.9079 | 5.6974 | nan | 16.1459 | 15.5762 | | Background_Matting | 4 | 0.8414 | 3.7372 | 5.7298 | 71.1029 | 16.0439 | 15.1237 | | functorch_dp_cifar10 | 64 | 0.3611 | 1.3536 | 2.0198 | nan | 12.2809 | 12.0318 | | hf_DistilBert | 8 | 0.4479 | 2.4474 | 5.2389 | 53.1654 | 11.5151 | 11.1782 | | resnet18 | 16 | 0.3464 | 1.5215 | 2.1233 | 31.821 | 10.4291 | 10.3554 | | pytorch_unet | 1 | 0.4618 | 1.6115 | 2.42 | 31.8839 | 7.9606 | 7.4347 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3467 | 1.5391 | 2.2993 | nan | 7.8215 | 7.5372 | | LearningToPaint | 96 | 0.4139 | 1.5133 | 2.3788 | 42.7004 | 6.6881 | 6.2962 | | squeezenet1_1 | 32 | 0.217 | 0.6991 | 1.0399 | 4.4853 | 3.8145 | 3.5455 | | drq | 1 | 0.1438 | 0.3449 | 0.5927 | 4.3999 | 3.5277 | 3.0508 | | vgg16 | 64 | 0.1861 | 0.4641 | 0.8212 | 3.2043 | 3.3134 | 3.0875 | | soft_actor_critic | 256 | 0.1978 | 0.295 | 0.4837 | 2.6168 | 3.22 | 2.5865 | | dlrm | 2048 | 0.4396 | nan | nan | nan | 3.2067 | 2.7934 | | nvidia_deeprecommender | 256 | 0.1899 | 0.3767 | 0.6305 | 7.0968 | 3.1709 | 2.8958 | | alexnet | 128 | 0.1608 | 0.3305 | 0.5539 | 3.7211 | 2.8652 | 2.5603 | | dcgan | 32 | 0.171 | 0.3611 | 0.5997 | 4.2351 | 2.5812 | 2.3373 | | lennard_jones | 1000 | 0.1359 | 0.2434 | 0.3902 | 2.131 | 1.9026 | 1.6783 | | tts_angular | 64 | 0.2069 | 0.25 | 0.3821 | 1.0185 | 1.8521 | 1.6359 | | demucs | 4 | 0.3151 | 0.3064 | 0.3171 | 0.2991 | 0.2075 | 0.2136 | | tacotron2 | 64 | 17.2755 | 28.9241 | nan | 63.0547 | nan | 63.0375 | | hf_GPT2_large | 4 | 4.9969 | 16.4034 | nan | nan | nan | 40.8929 | | hf_T5 | 8 | 2.2481 | 7.556 | nan | 94.6353 | nan | 26.2859 | | hf_Longformer | 2 | 6.0028 | 12.9459 | 57.8984 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9961 | 0.8279 | nan | 0.8271 | 1.5828 | 1.5828 | | resnet50_quantized_qat | 32 | 0.9971 | 0.9148 | nan | 0.8498 | 1.4864 | 1.4864 | | timm_efficientnet | 32 | 0.9932 | 0.7665 | 0.2636 | 0.771 | 1.3111 | 1.3958 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.2029 | 1.4002 | | mobilenet_v2 | 96 | 0.9923 | 0.7624 | 0.3061 | 0.7641 | 1.1741 | 1.2826 | | timm_efficientdet | 1 | 1.0104 | 0.8221 | nan | nan | 1.1174 | 1.1439 | | squeezenet1_1 | 32 | 0.9781 | 0.8163 | 0.3372 | 0.8132 | 1.0821 | 1.1897 | | speech_transformer | 32 | 0.9982 | 0.9159 | 0.2706 | nan | 1.0384 | 1.0419 | | timm_nfnet | 128 | 0.9358 | 0.8937 | nan | 0.879 | 1.0221 | 1.096 | | demucs | 4 | 0.9884 | 0.9884 | 0.9887 | 0.9884 | 0.9888 | 0.9884 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | Background_Matting | 4 | 0.9989 | 0.9482 | 0.3594 | 0.9327 | 0.9822 | 1.0383 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8142 | 0.9812 | 1.0425 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9984 | 0.8586 | 0.3317 | 0.8055 | 0.9375 | 1.0799 | | yolov3 | 16 | 0.9893 | 0.8384 | 0.3319 | nan | 0.9175 | 1.098 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.9106 | 0.393 | nan | 0.9166 | 1.0148 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8521 | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9966 | 1.009 | 0.4109 | nan | 0.9015 | 1.0694 | | timm_resnest | 32 | 0.9926 | 0.8759 | 0.3223 | 0.7295 | 0.8947 | 0.9967 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9876 | 0.856 | 0.3277 | 0.7754 | 0.8832 | 0.8974 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | densenet121 | 4 | 1.0 | 0.8879 | 0.3462 | 0.8612 | 0.8624 | 1.006 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | resnet50 | 32 | 0.9945 | 0.8704 | 0.3364 | 0.7952 | 0.8552 | 0.9335 | | mnasnet1_0 | 32 | 0.9878 | 0.8992 | 0.3333 | 0.8256 | 0.8532 | 0.8671 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.8669 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8777 | nan | 0.8325 | 0.8326 | 1.1284 | | resnext50_32x4d | 8 | 0.9961 | 0.8679 | 0.3584 | 0.8198 | 0.8278 | 0.8346 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | 0.8503 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.43 | 0.9629 | 0.8211 | 1.0392 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8756 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9946 | 0.7591 | 0.3201 | 0.7584 | 0.7591 | 0.9501 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3308 | 0.8714 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9304 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7451 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3526 | 0.857 | 0.7061 | 1.0016 | | dlrm | 2048 | 0.7302 | nan | nan | nan | 0.7035 | 0.7307 | | LearningToPaint | 96 | 0.9455 | 0.6929 | 0.3399 | 0.627 | 0.6945 | 0.9371 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6949 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | 0.8674 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | nan | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | 0.7128 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0302 | nan | 0.7898 | nan | 1.1621 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8195 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2946 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0273 | 0.9186 | 0.0 | 0.0 | 3.1616 | 1.4262 | | MT5ForConditionalGeneration | 8 | 1.0235 | 0.9258 | 0.0 | 1.0147 | 2.6483 | 2.2481 | | CamemBert | 1 | 1.0464 | 0.9433 | 1.2843 | 0.0 | 2.3854 | 1.4341 | | DistillGPT2 | 1 | 1.0392 | 0.9328 | 1.1877 | 0.3103 | 2.1039 | 1.7709 | | MobileBertForMaskedLM | 32 | 1.0232 | 0.9254 | 0.0 | 0.8091 | 2.0086 | 1.4843 | | GoogleFnet | 1 | 0.9937 | 0.8142 | 0.9827 | 0.0 | 1.9074 | 1.1742 | | GPT2ForSequenceClassification | 4 | 0.9998 | 0.9773 | 0.0 | 0.6259 | 1.7976 | 1.7816 | | T5ForConditionalGeneration | 4 | 1.0041 | 0.9334 | 0.0 | 0.908 | 1.4576 | 1.4423 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9751 | 0.0 | 0.6302 | 1.4274 | 1.4067 | | ElectraForCausalLM | 32 | 1.0013 | 0.9329 | 0.0 | 0.5016 | 1.414 | 1.4522 | | MobileBertForQuestionAnswering | 64 | 1.0166 | 0.9419 | 0.0 | 0.6836 | 1.4073 | 1.3074 | | M2M100ForConditionalGeneration | 8 | 1.0953 | 0.9403 | 0.8906 | 0.0 | 1.3749 | 1.3418 | | LayoutLMForSequenceClassification | 16 | 1.0004 | 0.9899 | 0.7381 | 0.682 | 1.3106 | 1.2899 | | T5Small | 1 | 1.0242 | 0.9324 | 0.0 | 0.0 | 1.2735 | 1.2424 | | AlbertForQuestionAnswering | 4 | 0.9997 | 1.0008 | 0.0 | 0.0 | 1.2625 | 1.2598 | | AlbertForMaskedLM | 4 | 0.9995 | 0.9994 | 0.0 | 0.0 | 1.2558 | 1.2544 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9712 | 0.0 | 0.6349 | 1.2038 | 1.2157 | | PLBartForConditionalGeneration | 16 | 1.011 | 0.9727 | 0.0 | 0.7228 | 1.2035 | 1.205 | | OPTForCausalLM | 32 | 1.003 | 0.935 | 0.0 | 0.3974 | 1.1842 | 1.2128 | | XGLMForCausalLM | 8 | 1.012 | 0.9422 | 0.7419 | 0.3209 | 1.1705 | 1.1668 | | DistilBertForQuestionAnswering | 64 | 0.9998 | 0.9858 | 0.7122 | 0.3918 | 1.1703 | 1.1516 | | DebertaForMaskedLM | 4 | 0.9293 | 0.8121 | 0.7315 | 0.5959 | 1.1571 | 1.0988 | | RobertaForCausalLM | 64 | 0.9993 | 0.964 | 0.7389 | 0.5506 | 1.1541 | 1.1602 | | MegatronBertForQuestionAnswering | 16 | 1.0379 | 0.932 | 0.7563 | 0.0 | 1.1428 | 1.1559 | | MegatronBertForCausalLM | 16 | 1.0339 | 1.0061 | 0.7457 | 0.0 | 1.1269 | 1.1243 | | BertForQuestionAnswering | 128 | 1.0 | 0.9862 | 0.0 | 0.5642 | 1.1153 | 1.1074 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9937 | 0.0 | 0.5654 | 1.1141 | 1.1176 | | Speech2Text2ForCausalLM | 128 | 0.9986 | 0.9268 | 0.6603 | 0.5442 | 1.1129 | 1.1484 | | BartForCausalLM | 4 | 1.0008 | 0.9664 | 0.0 | 0.6745 | 1.0995 | 1.1111 | | BigBird | 1 | 0.9925 | 0.9298 | 1.0009 | 0.0 | 1.0994 | 1.0086 | | BartForConditionalGeneration | 2 | 1.0007 | 0.9886 | 0.0 | 0.3991 | 1.0961 | 1.0892 | | MBartForConditionalGeneration | 16 | 1.01 | 0.9846 | 0.0 | 0.0 | 1.0949 | 1.0856 | | PegasusForConditionalGeneration | 16 | 1.0111 | 0.9797 | 0.754 | 0.0 | 1.0883 | 1.0954 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0009 | 0.9396 | 0.0 | 0.6394 | 1.0646 | 1.0711 | | BertForMaskedLM | 64 | 1.0001 | 0.9623 | 0.7312 | 0.5534 | 1.0606 | 1.0624 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.9524 | 0.7129 | 0.4263 | 1.0508 | 1.0695 | | DebertaForQuestionAnswering | 8 | 0.9962 | 0.9934 | 0.6829 | 0.7999 | 1.0466 | 1.2334 | | PLBartForCausalLM | 32 | 1.0057 | 0.9356 | 0.7288 | 0.7088 | 1.0306 | 1.0554 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9557 | 0.0 | 0.6626 | 1.0042 | 1.0154 | | BlenderbotSmallForCausalLM | 64 | 1.0015 | 0.9101 | 0.6832 | 0.6603 | 1.0011 | 1.0369 | | MBartForCausalLM | 32 | 1.0013 | 0.9553 | 0.0 | 0.0 | 0.9996 | 1.0101 | | PegasusForCausalLM | 32 | 0.9993 | 0.953 | 0.7331 | 0.0 | 0.9909 | 1.003 | | AllenaiLongformerBase | 1 | 0.9458 | 0.8589 | 0.7872 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.7086 | 10.0333 | 34.2455 | 87.5999 | 92.1527 | 33.6062 | | DebertaForMaskedLM | 4 | 4.849 | 10.002 | 33.3758 | 91.7736 | 91.6206 | 33.2123 | | XGLMForCausalLM | 8 | 2.3788 | 9.6357 | 35.7346 | 256.1148 | 66.8286 | 64.0204 | | M2M100ForConditionalGeneration | 8 | 2.6368 | 11.8275 | 19.8845 | nan | 56.5971 | 51.7786 | | MobileBertForMaskedLM | 32 | 8.0753 | 22.6523 | nan | 371.4544 | 48.4572 | 47.475 | | MobileBertForQuestionAnswering | 64 | 8.1067 | 22.9882 | nan | 371.2092 | 48.0667 | 46.4089 | | PegasusForConditionalGeneration | 16 | 2.6741 | 11.6737 | 20.0578 | nan | 42.0617 | 39.005 | | BartForConditionalGeneration | 2 | 2.8944 | 12.3157 | nan | 323.9879 | 41.8569 | 40.5013 | | MBartForConditionalGeneration | 16 | 2.8228 | 12.2121 | nan | nan | 40.9542 | 39.8883 | | YituTechConvBert | 1 | 2.13 | 7.8814 | nan | nan | 38.2986 | 36.602 | | BigBird | 1 | 7.3766 | 12.4279 | 26.2856 | nan | 37.178 | 24.285 | | MegatronBertForCausalLM | 16 | 3.3266 | 10.1649 | 16.6766 | nan | 32.2663 | 31.1311 | | MT5ForConditionalGeneration | 8 | 3.705 | 10.7058 | nan | 148.6673 | 31.3215 | 29.9935 | | MegatronBertForQuestionAnswering | 16 | 2.9827 | 10.7253 | 16.7351 | nan | 31.2309 | 30.0711 | | T5ForConditionalGeneration | 4 | 2.1895 | 7.2353 | nan | 97.6447 | 29.3612 | 27.8059 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7111 | 8.1034 | nan | 209.4689 | 28.7124 | 27.5592 | | T5Small | 1 | 2.2617 | 7.2523 | nan | nan | 27.9764 | 27.4661 | | LayoutLMForSequenceClassification | 16 | 1.807 | 5.4418 | 8.6669 | 124.1254 | 26.4315 | 26.0938 | | ElectraForCausalLM | 32 | 1.3611 | 5.247 | nan | 133.2186 | 25.6214 | 23.6269 | | PLBartForConditionalGeneration | 16 | 1.4205 | 6.2564 | nan | 164.6091 | 25.4317 | 24.7709 | | PegasusForCausalLM | 32 | 1.0146 | 4.593 | 7.6896 | nan | 20.9683 | 19.6471 | | LayoutLMForMaskedLM | 16 | 1.6524 | 5.4995 | nan | 128.8886 | 20.2951 | 19.3529 | | MBartForCausalLM | 32 | 1.0922 | 4.4676 | nan | nan | 20.2644 | 19.6143 | | GoogleFnet | 1 | 0.7909 | 2.7041 | 8.8904 | nan | 20.2557 | 13.383 | | BertForMaskedLM | 64 | 1.3189 | 5.1658 | 7.8498 | 128.6829 | 19.6608 | 18.9434 | | ElectraForQuestionAnswering | 64 | 1.3447 | 5.1855 | nan | 128.5838 | 19.3358 | 18.5138 | | BertForQuestionAnswering | 128 | 1.313 | 5.4231 | nan | 121.2353 | 18.9729 | 18.1248 | | TrOCRForCausalLM | 32 | 0.9723 | 4.4424 | nan | 121.3514 | 18.8691 | 18.4661 | | RobertaForCausalLM | 64 | 1.3548 | 5.1821 | 8.0622 | 126.9866 | 18.8105 | 18.9716 | | BartForCausalLM | 4 | 0.9844 | 4.4942 | nan | 120.2172 | 18.5904 | 18.3724 | | CamemBert | 1 | 1.3744 | 5.1649 | 7.6819 | nan | 18.255 | 17.3049 | | RobertaForQuestionAnswering | 128 | 1.3839 | 5.4661 | nan | 128.2646 | 18.0367 | 18.0055 | | OPTForCausalLM | 32 | 1.0473 | 4.608 | nan | 107.5595 | 16.6883 | 16.3425 | | AlbertForQuestionAnswering | 4 | 1.1102 | 4.5023 | nan | nan | 16.2123 | 14.8481 | | AlbertForMaskedLM | 4 | 1.1526 | 4.5502 | nan | nan | 16.0563 | 14.8712 | | GPT2ForSequenceClassification | 4 | 1.3696 | 4.8441 | nan | 90.2676 | 15.9983 | 15.5484 | | BlenderbotSmallForCausalLM | 64 | 0.6235 | 3.1879 | 4.8904 | 86.9734 | 14.2201 | 13.6717 | | Speech2Text2ForCausalLM | 128 | 0.5824 | 2.3669 | 4.3407 | 58.5153 | 14.0809 | 12.6021 | | PLBartForCausalLM | 32 | 0.5193 | 2.4175 | 3.8512 | 63.566 | 12.9574 | 12.8385 | | DistillGPT2 | 1 | 0.6683 | 2.5025 | 3.6332 | 47.9 | 12.2063 | 11.7182 | | DistilBertForMaskedLM | 64 | 0.4818 | 2.3956 | 5.3341 | 54.6281 | 11.3092 | 10.6517 | | DistilBertForQuestionAnswering | 64 | 0.4748 | 2.4152 | 5.3039 | 54.7537 | 10.5179 | 10.0411 | | AllenaiLongformerBase | 1 | 6.0492 | 12.7833 | 55.1035 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8819 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | nan | nan | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4209 | nan | 0.8224 | 1.0095 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | 0.8577 | 0.8215 | 1.1049 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8115 | 0.3771 | 0.7597 | 0.8063 | 0.926 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7039 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9858 | 0.8581 | nan | nan | 0.7897 | 0.8729 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | nan | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9983 | 0.9443 | 0.3715 | nan | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | nan | 0.8712 | 0.7627 | 0.9397 | | M2M100ForConditionalGeneration | 8 | 1.0187 | 0.9611 | 0.383 | nan | 0.7616 | 1.0146 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | nan | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.8428 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8957 | nan | 0.8416 | 0.724 | 0.9375 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.8653 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | 0.8553 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8217 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.8762 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8551 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.7756 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | 0.8587 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.737 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8656 | nan | 0.7894 | 0.6761 | 0.8847 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.669 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.7194 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3642 | 0.7649 | 0.6375 | 0.8974 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.863 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | 0.6092 | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | 0.6675 | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3552 | 0.8282 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.321 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9741 | 0.8282 | 0.0 | 1.8674 | 1.8301 | | lcnet_050 | 128 | 0.9563 | 0.9441 | 0.7694 | 1.4303 | 1.6653 | 1.695 | | dm_nfnet_f0 | 128 | 0.9997 | 1.0006 | 0.0 | 0.0 | 1.4717 | 1.4248 | | convnext_base | 64 | 1.0 | 0.9986 | 0.0 | 1.3661 | 1.4689 | 1.4695 | | hrnet_w18 | 128 | 1.0 | 0.9997 | 0.0 | 0.0 | 1.4171 | 1.3781 | | volo_d1_224 | 64 | 1.0 | 0.9935 | 0.0 | 0.0 | 1.3861 | 1.3651 | | dla102 | 128 | 1.0 | 1.0009 | 0.0 | 0.0 | 1.3825 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7889 | 0.0 | 0.0 | 1.3767 | 1.3289 | | res2net50_14w_8s | 128 | 0.9998 | 0.9999 | 0.0 | 1.2417 | 1.3566 | 1.3242 | | xcit_large_24_p8_224 | 5 | 1.0002 | 0.9715 | 0.0 | 0.0 | 1.3518 | 1.3174 | | mobilenetv3_large_100 | 128 | 0.9661 | 0.9612 | 0.7654 | 1.2977 | 1.3357 | 1.3494 | | mobilenetv2_100 | 128 | 0.9663 | 0.9614 | 0.7075 | 1.2317 | 1.3283 | 1.3485 | | inception_v3 | 128 | 0.9999 | 0.9978 | 0.0 | 1.1278 | 1.3277 | 1.3064 | | gluon_inception_v3 | 128 | 1.0 | 0.9991 | 0.0 | 1.1281 | 1.3275 | 1.3073 | | crossvit_9_240 | 128 | 0.9996 | 0.9987 | 0.0 | 0.0 | 1.3263 | 1.3027 | | adv_inception_v3 | 128 | 1.0002 | 0.9978 | 0.0 | 1.125 | 1.3243 | 1.3088 | | resnest101e | 64 | 0.9998 | 1.0038 | 0.0 | 0.0 | 1.3176 | 1.2721 | | res2next50 | 128 | 1.0 | 1.0008 | 0.0 | 1.172 | 1.3111 | 1.2743 | | regnety_002 | 128 | 0.9534 | 0.9423 | 0.7586 | 1.06 | 1.3034 | 1.3328 | | fbnetv3_b | 128 | 0.9648 | 0.9614 | 0.7567 | 0.0 | 1.2845 | 1.2954 | | coat_lite_mini | 128 | 1.0001 | 0.9856 | 0.8526 | 0.4087 | 1.2773 | 1.2789 | | jx_nest_base | 32 | 0.9994 | 0.992 | 0.0 | 0.0 | 1.2758 | 1.2516 | | eca_botnext26ts_256 | 128 | 0.9873 | 0.7719 | 0.0 | 0.0 | 1.2681 | 1.2539 | | mnasnet_100 | 128 | 0.9651 | 0.9603 | 0.7856 | 1.2527 | 1.2655 | 1.2824 | | sebotnet33ts_256 | 64 | 0.9755 | 0.8018 | 0.0 | 0.0 | 1.2653 | 1.2692 | | selecsls42b | 128 | 0.9999 | 1.0002 | 0.8152 | 1.2129 | 1.2653 | 1.2521 | | tf_efficientnet_b0 | 128 | 0.9758 | 0.7842 | 0.0 | 1.2159 | 1.2622 | 1.2681 | | eca_halonext26ts | 128 | 0.9876 | 0.7786 | 0.0 | 0.0 | 1.2593 | 1.2444 | | gmixer_24_224 | 128 | 0.9999 | 0.8105 | 0.0 | 0.0 | 1.2577 | 1.2298 | | botnet26t_256 | 128 | 0.986 | 0.9813 | 0.788 | 0.0 | 1.2534 | 1.2532 | | fbnetc_100 | 128 | 0.9665 | 0.963 | 0.7788 | 1.2452 | 1.2484 | 1.266 | | ese_vovnet19b_dw | 128 | 0.9796 | 0.9775 | 0.745 | 1.1607 | 1.2428 | 1.2452 | | res2net101_26w_4s | 64 | 0.9999 | 0.9999 | 0.7742 | 0.0 | 1.2292 | 1.189 | | cspdarknet53 | 64 | 0.9572 | 0.953 | 0.7355 | 1.196 | 1.2268 | 1.2375 | | spnasnet_100 | 128 | 0.9622 | 0.9578 | 0.7749 | 1.2164 | 1.2169 | 1.2554 | | rexnet_100 | 128 | 0.973 | 0.8146 | 0.0 | 1.1852 | 1.2133 | 1.2203 | | pnasnet5large | 16 | 0.9997 | 0.9989 | 0.0 | 1.0649 | 1.2108 | 1.193 | | convit_base | 64 | 0.9998 | 0.999 | 0.0 | 0.8567 | 1.2047 | 1.2295 | | twins_pcpvt_base | 64 | 0.9999 | 0.9988 | 0.7529 | 0.0 | 1.203 | 1.17 | | gmlp_s16_224 | 128 | 0.9999 | 0.9505 | 0.0 | 0.0 | 1.2029 | 1.1867 | | tinynet_a | 128 | 0.9662 | 0.7763 | 0.6213 | 1.2103 | 1.1903 | 1.2 | | dpn107 | 32 | 0.957 | 0.9502 | 0.7811 | 0.0 | 1.1889 | 1.2008 | | pit_b_224 | 64 | 0.9998 | 0.9996 | 0.0 | 0.5952 | 1.1864 | 1.1784 | | cait_m36_384 | 4 | 1.0 | 1.0269 | 0.0 | 0.0 | 1.1805 | 1.1577 | | mobilevit_s | 64 | 0.9789 | 0.7624 | 0.0 | 0.0 | 1.1707 | 1.1673 | | tf_mixnet_l | 128 | 0.9855 | 0.8899 | 0.0 | 0.0 | 1.1699 | 1.1669 | | repvgg_a2 | 128 | 0.9656 | 0.9623 | 0.8278 | 1.1219 | 1.1697 | 1.168 | | poolformer_m36 | 64 | 1.0 | 0.9998 | 0.0 | 0.0 | 1.1668 | 1.1485 | | mixnet_l | 128 | 0.9849 | 0.8864 | 0.0 | 0.0 | 1.151 | 1.1484 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9781 | 0.0 | 0.0 | 1.146 | 1.1253 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9812 | 0.0 | 0.5267 | 1.1131 | 1.1025 | | swsl_resnext101_32x16d | 32 | 0.9999 | 1.0008 | 0.0 | 0.0 | 1.1098 | 1.0714 | | deit_base_distilled_patch16_224 | 64 | 1.0001 | 0.9991 | 0.7674 | 0.5693 | 1.0966 | 1.0835 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9992 | 0.7673 | 0.5559 | 1.0888 | 1.0747 | | gluon_xception65 | 32 | 1.0 | 0.9978 | 0.0 | 0.0 | 1.0867 | 1.076 | | convmixer_768_32 | 32 | 0.9999 | 1.0001 | 0.0 | 0.0 | 1.0777 | 1.0746 | | gernet_l | 128 | 0.9747 | 0.972 | 0.8212 | 1.0946 | 1.0768 | 1.0705 | | mixer_b16_224 | 128 | 1.0004 | 0.9788 | 0.0 | 0.0 | 1.0671 | 1.0605 | | visformer_small | 128 | 0.9996 | 1.003 | 0.7984 | 0.0 | 1.0477 | 1.0136 | | resmlp_12_224 | 128 | 0.9998 | 0.8556 | 0.6127 | 0.8393 | 0.8129 | 0.8045 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9996 | 0.0 | 0.0 | 0.0 | 1.5441 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.5888 | 24.2437 | nan | nan | 97.9463 | 93.2078 | | swin_base_patch4_window7_224 | 64 | 2.5169 | 10.3488 | nan | nan | 74.6863 | 72.1701 | | mobilevit_s | 64 | 1.6383 | 5.9437 | nan | nan | 71.8587 | 70.1179 | | xcit_large_24_p8_224 | 5 | 2.6873 | 13.5625 | nan | nan | 70.9424 | 67.5898 | | pnasnet5large | 16 | 4.3753 | 17.593 | nan | 505.4688 | 70.5716 | 66.4458 | | twins_pcpvt_base | 64 | 2.1192 | 10.1987 | 19.2279 | nan | 61.8783 | 59.7553 | | cait_m36_384 | 4 | 2.8008 | 14.232 | nan | nan | 59.399 | 56.5587 | | convnext_base | 64 | 1.2461 | 5.1566 | nan | 174.7346 | 59.2713 | 56.7413 | | resnest101e | 64 | 2.9184 | 12.6038 | nan | nan | 55.344 | 53.1662 | | jx_nest_base | 32 | 1.6276 | 7.7104 | nan | nan | 52.6264 | 50.5839 | | res2net101_26w_4s | 64 | 2.881 | 13.2532 | 22.8044 | nan | 51.1051 | 47.3424 | | coat_lite_mini | 128 | 1.0791 | 4.21 | 6.7933 | 115.9784 | 47.7771 | 46.7015 | | res2net50_14w_8s | 128 | 2.6048 | 11.9366 | nan | 359.7148 | 46.7425 | 43.3897 | | sebotnet33ts_256 | 64 | 1.5807 | 5.258 | nan | nan | 46.6735 | 44.7158 | | eca_halonext26ts | 128 | 1.439 | 4.5113 | nan | nan | 46.6293 | 45.2509 | | poolformer_m36 | 64 | 1.8069 | 7.064 | nan | nan | 43.8297 | 42.6002 | | eca_botnext26ts_256 | 128 | 1.3717 | 4.3087 | nan | nan | 38.4344 | 36.8811 | | gmlp_s16_224 | 128 | 0.9565 | 5.4137 | nan | nan | 38.0285 | 36.8012 | | dpn107 | 32 | 3.8383 | 11.4405 | 35.6197 | nan | 36.5966 | 35.2177 | | crossvit_9_240 | 128 | 1.3786 | 6.3785 | nan | nan | 35.7154 | 33.7752 | | botnet26t_256 | 128 | 1.3378 | 3.7267 | 8.3618 | nan | 35.358 | 35.0375 | | fbnetv3_b | 128 | 3.0262 | 9.477 | 25.2709 | nan | 34.418 | 32.5671 | | gluon_xception65 | 32 | 1.7744 | 8.9125 | nan | nan | 34.063 | 31.8574 | | volo_d1_224 | 64 | 1.2919 | 6.3509 | nan | nan | 34.0205 | 32.5635 | | adv_inception_v3 | 128 | 1.5122 | 6.75 | nan | 173.9284 | 32.6247 | 30.0091 | | gluon_inception_v3 | 128 | 1.4979 | 6.8521 | nan | 166.8407 | 31.603 | 30.0093 | | inception_v3 | 128 | 1.4834 | 6.7508 | nan | 172.6174 | 31.569 | 29.8776 | | ghostnet_100 | 128 | 2.796 | 8.0855 | 11.8454 | nan | 31.1622 | 29.1912 | | tf_mixnet_l | 128 | 5.6246 | 11.1281 | nan | nan | 30.7412 | 28.9674 | | mixnet_l | 128 | 5.3248 | 10.7982 | nan | nan | 29.7912 | 27.4381 | | dla102 | 128 | 1.6913 | 7.5283 | nan | nan | 29.4475 | 27.2233 | | gmixer_24_224 | 128 | 1.0388 | 5.8165 | nan | nan | 29.1547 | 28.1432 | | swsl_resnext101_32x16d | 32 | 1.7439 | 7.3381 | nan | nan | 28.9308 | 26.8624 | | dm_nfnet_f0 | 128 | 2.0773 | 6.1089 | nan | nan | 28.4559 | 26.9281 | | res2next50 | 128 | 1.5884 | 6.6926 | nan | 190.4226 | 27.1242 | 25.2755 | | convit_base | 64 | 1.0461 | 4.6199 | nan | 144.9589 | 26.8184 | 25.9175 | | rexnet_100 | 128 | 1.807 | 6.2274 | nan | 166.7626 | 25.3587 | 23.8891 | | tinynet_a | 128 | 1.9749 | 6.5257 | 17.2876 | 157.1425 | 24.768 | 23.1586 | | cspdarknet53 | 64 | 2.2072 | 6.3417 | 16.9175 | 130.3485 | 21.7161 | 20.8708 | | tf_efficientnet_b0 | 128 | 1.7664 | 5.605 | nan | 136.3086 | 21.5803 | 20.5332 | | resmlp_12_224 | 128 | 0.621 | 2.406 | 5.1282 | 56.116 | 21.2473 | 20.9999 | | fbnetc_100 | 128 | 1.9584 | 5.5214 | 15.973 | 125.3762 | 21.1351 | 19.4896 | | mixer_b16_224 | 128 | 0.6622 | 2.7051 | nan | nan | 20.8883 | 19.9599 | | convmixer_768_32 | 32 | 1.112 | 4.844 | nan | nan | 20.8864 | 19.7493 | | visformer_small | 128 | 0.9209 | 3.409 | 5.3386 | nan | 20.8505 | 20.0217 | | spnasnet_100 | 128 | 1.9007 | 5.3713 | 14.8549 | 120.154 | 20.7479 | 18.9316 | | nfnet_l0 | 128 | 1.7752 | 6.0854 | nan | nan | 20.4908 | 19.1256 | | mobilenetv3_large_100 | 128 | 1.5272 | 4.8062 | 11.6397 | 131.032 | 19.387 | 17.854 | | beit_base_patch16_224 | 64 | 1.1577 | 4.2495 | nan | 120.6803 | 19.3673 | 18.2494 | | mobilenetv2_100 | 128 | 1.5693 | 4.7451 | 11.6494 | 113.0248 | 18.7423 | 17.6742 | | deit_base_distilled_patch16_224 | 64 | 0.8498 | 3.6265 | 5.8705 | 94.0323 | 18.5081 | 17.8015 | | vit_base_patch16_224 | 64 | 0.8149 | 3.4287 | 5.818 | 101.14 | 18.1878 | 17.5384 | | repvgg_a2 | 128 | 1.92 | 5.137 | 13.8343 | 267.2763 | 17.6888 | 16.4886 | | pit_b_224 | 64 | 0.9866 | 3.863 | nan | 122.7453 | 17.5858 | 16.2644 | | mnasnet_100 | 128 | 1.5617 | 4.524 | 11.7846 | 96.0607 | 17.5569 | 16.2077 | | regnety_002 | 128 | 1.5674 | 4.5314 | 11.297 | 102.8675 | 17.495 | 16.5119 | | gernet_l | 128 | 1.9039 | 5.0169 | 14.1071 | 105.4468 | 16.9565 | 16.0673 | | selecsls42b | 128 | 0.8021 | 2.981 | 5.0317 | 80.5321 | 15.7906 | 14.7209 | | lcnet_050 | 128 | 0.9941 | 2.8945 | 6.5183 | 75.022 | 12.6321 | 11.8098 | | ese_vovnet19b_dw | 128 | 0.985 | 2.6861 | 5.8998 | 53.8994 | 12.1823 | 11.402 | | tnt_s_patch16_224 | 128 | 1.5405 | 8.4167 | nan | nan | nan | 30.7468 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | nan | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9943 | 0.7798 | 0.2617 | 0.7835 | 1.3515 | 1.5856 | | nfnet_l0 | 128 | 0.993 | 0.8275 | nan | nan | 1.2906 | 1.4934 | | rexnet_100 | 128 | 0.9938 | 0.7848 | nan | 0.7887 | 1.2631 | 1.4763 | | tf_efficientnet_b0 | 128 | 0.9936 | 0.7689 | nan | 0.7729 | 1.206 | 1.3823 | | mobilevit_s | 64 | 0.9964 | 0.7671 | nan | nan | 1.1799 | 1.3596 | | mobilenetv2_100 | 128 | 0.9923 | 0.7619 | 0.3063 | 0.7644 | 1.1747 | 1.2829 | | pnasnet5large | 16 | 1.0657 | 1.0089 | nan | 0.961 | 1.1724 | 1.3351 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1377 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1375 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1185 | 1.1746 | | poolformer_m36 | 64 | 0.9983 | 0.9509 | nan | nan | 1.0522 | 1.0698 | | dm_nfnet_f0 | 128 | 0.9357 | 0.894 | nan | nan | 1.0221 | 1.0963 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9253 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.9131 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.993 | 0.7828 | 0.3098 | nan | 0.9932 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.9157 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9923 | 1.0857 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9847 | 0.9968 | | ghostnet_100 | 128 | 0.9866 | 0.8768 | 0.3271 | nan | 0.9842 | 1.1252 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | nan | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9955 | 0.8572 | nan | nan | 0.9767 | 1.1453 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | nan | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | nan | nan | 0.9633 | 1.0573 | | dla102 | 128 | 0.9828 | 0.9169 | nan | nan | 0.9625 | 1.0421 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8868 | 0.3259 | 0.8551 | 0.951 | 1.0926 | | gluon_xception65 | 32 | 0.9975 | 0.9358 | nan | nan | 0.9412 | 0.9929 | | mobilenetv3_large_100 | 128 | 0.9874 | 0.8592 | 0.3245 | 0.7755 | 0.941 | 1.0413 | | hrnet_w18 | 128 | 0.9955 | 0.9252 | nan | nan | 0.9382 | 1.0121 | | spnasnet_100 | 128 | 0.9885 | 0.9103 | 0.3308 | 0.8383 | 0.9379 | 0.9927 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9022 | 0.3305 | 0.8252 | 0.9325 | 0.9921 | | res2net101_26w_4s | 64 | 0.9967 | 0.9278 | 0.3243 | nan | 0.9301 | 1.0167 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9154 | 0.9655 | | cspdarknet53 | 64 | 0.9954 | 0.8613 | 0.3159 | 0.8261 | 0.9148 | 1.0666 | | inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.914 | 1.063 | | gluon_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.914 | 1.063 | | adv_inception_v3 | 128 | 0.99 | 0.8616 | nan | 0.8238 | 0.9138 | 1.063 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.867 | 0.9126 | 0.9981 | | res2next50 | 128 | 0.9955 | 0.9149 | nan | 0.8461 | 0.9075 | 1.0161 | | mixnet_l | 128 | 0.995 | 0.845 | nan | nan | 0.9069 | 1.0619 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0515 | | fbnetc_100 | 128 | 0.989 | 0.8518 | 0.3236 | 0.7416 | 0.9049 | 0.9971 | | dpn107 | 32 | 0.9986 | 0.9268 | 0.3389 | nan | 0.9047 | 0.9908 | | visformer_small | 128 | 0.9944 | 0.9374 | 0.3291 | nan | 0.9029 | 0.9934 | | selecsls42b | 128 | 0.9885 | 0.8897 | 0.337 | 0.8775 | 0.8987 | 1.0049 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.8965 | nan | nan | 0.8912 | 0.9925 | | res2net50_14w_8s | 128 | 0.995 | 0.9047 | nan | 0.8422 | 0.8821 | 1.0211 | | regnety_002 | 128 | 0.9718 | 0.8105 | 0.3284 | 0.7203 | 0.8619 | 1.0399 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9624 | | pit_b_224 | 64 | 0.9968 | 0.7946 | nan | 0.7502 | 0.8563 | 1.0753 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7251 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7891 | 0.32 | 0.7891 | 0.7928 | 0.9932 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6276 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6551 | 0.7684 | 0.9903 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8573 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | nan | nan | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/rDgOT6i.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/u7zlkMw.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/kDPVHWV.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 52/53 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 52/53 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 77%, 41/53 | 60%, 25/42  | 79%, 48/61  |
|    nvprims_nvfuser     | 51%, 27/53 |  12%, 5/42  | 33%, 20/61  |
|        inductor        | 87%, 46/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 48/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.07x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.00x    |    1.11x    |
|        inductor        |   1.69x    |    1.80x    |    1.40x    |
| inductor_no_cudagraphs |   1.39x    |    1.55x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.32    |    2.74     |    2.15     |
|       aot_eager        |    6.85    |    10.15    |    8.44     |
|     aot_cudagraphs     |   10.30    |    18.47    |    16.72    |
|    nvprims_nvfuser     |   67.21    |   130.37    |   159.28    |
|        inductor        |   30.35    |    34.63    |    40.31    |
| inductor_no_cudagraphs |   30.84    |    29.62    |    38.38    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.42x    |    0.39x    |    0.32x    |
|    nvprims_nvfuser     |   0.75x    |    0.79x    |    0.70x    |
|        inductor        |   0.84x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.97x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/32Bt0mZ.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/rr2FzrO.png)

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.001 | 0.9047 | 1.88 | 0.6659 | 5.2929 | 1.4814 | | timm_efficientdet | 1 | 0.9809 | 0.7864 | 0.0 | 0.0 | 4.2755 | 1.7545 | | functorch_dp_cifar10 | 64 | 0.9906 | 0.9393 | 1.5989 | 0.0 | 3.568 | 1.5469 | | BERT_pytorch | 16 | 1.0106 | 0.8376 | 0.0 | 0.0 | 3.468 | 2.4041 | | timm_vision_transformer | 8 | 1.0028 | 0.8449 | 1.7903 | 0.6228 | 3.215 | 1.5495 | | hf_T5_large | 2 | 1.0159 | 0.8599 | 0.0 | 0.0 | 2.5755 | 2.1407 | | drq | 1 | 1.0002 | 0.7937 | 1.3343 | 0.5744 | 2.4896 | 1.1953 | | mobilenet_v3_large | 32 | 1.0047 | 1.0052 | 1.1991 | 0.742 | 2.4501 | 1.5737 | | resnext50_32x4d | 8 | 1.0022 | 0.9486 | 1.3175 | 0.6522 | 2.4415 | 1.379 | | hf_Albert | 8 | 1.0002 | 0.9562 | 0.7739 | 0.0 | 2.3833 | 2.3246 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9957 | 1.0123 | 1.4122 | 0.0 | 2.2841 | 1.8326 | | mnasnet1_0 | 32 | 0.9981 | 1.0306 | 1.0733 | 0.7326 | 2.1345 | 1.5445 | | pytorch_struct | 200 | 0.9819 | 0.743 | 1.0199 | 0.6322 | 2.11 | 1.2766 | | lennard_jones | 1000 | 0.9615 | 0.7623 | 1.2933 | 0.4617 | 2.0498 | 1.0444 | | hf_Bert | 4 | 1.0343 | 0.8569 | 0.9421 | 0.0 | 2.0464 | 1.8641 | | resnet18 | 16 | 1.0049 | 1.0117 | 1.1615 | 0.7223 | 1.9824 | 1.4145 | | hf_GPT2 | 4 | 1.0237 | 0.9858 | 0.0 | 0.3008 | 1.9516 | 1.9066 | | hf_T5 | 8 | 0.9998 | 0.927 | 0.0 | 1.1398 | 1.8855 | 1.892 | | squeezenet1_1 | 32 | 1.0002 | 0.9621 | 1.0918 | 0.6848 | 1.8653 | 1.4243 | | timm_resnest | 32 | 0.9979 | 0.9868 | 0.814 | 0.0 | 1.8562 | 1.7153 | | hf_Bart | 4 | 1.0124 | 0.8313 | 0.0 | 0.0 | 1.7793 | 1.8567 | | dcgan | 32 | 1.0181 | 0.9046 | 1.0656 | 0.6141 | 1.7288 | 1.0278 | | speech_transformer | 32 | 1.0045 | 0.8391 | 1.9589 | 0.0 | 1.7265 | 1.6984 | | soft_actor_critic | 256 | 0.9736 | 0.7478 | 1.3134 | 0.549 | 1.695 | 1.018 | | timm_efficientnet | 32 | 0.9507 | 0.7804 | 0.872 | 0.0 | 1.688 | 1.3609 | | attention_is_all_you_need_pytorch | 256 | 1.009 | 0.89 | 0.0 | 0.6268 | 1.574 | 1.4911 | | mobilenet_v2 | 96 | 0.9999 | 0.9895 | 0.7609 | 1.2056 | 1.559 | 1.5169 | | fastNLP_Bert | 6 | 0.9975 | 0.8975 | 0.7665 | 0.0 | 1.5269 | 1.473 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 1.0442 | 0.8823 | 0.873 | 1.5173 | 1.4042 | | hf_DistilBert | 8 | 1.0019 | 0.9703 | 0.7435 | 0.2952 | 1.5158 | 1.4881 | | timm_nfnet | 128 | 0.9992 | 0.9993 | 0.0 | 1.0813 | 1.511 | 1.4334 | | LearningToPaint | 96 | 1.0029 | 1.0181 | 0.9219 | 0.8118 | 1.508 | 1.3483 | | resnet50 | 32 | 1.0036 | 1.0443 | 0.817 | 0.7968 | 1.363 | 1.288 | | pytorch_stargan | 16 | 0.998 | 1.1357 | 0.9636 | 0.0 | 1.3616 | 1.3195 | | pytorch_unet | 1 | 0.9993 | 0.9922 | 0.8621 | 1.0847 | 1.3388 | 1.3146 | | Super_SloMo | 6 | 0.9995 | 0.9947 | 0.885 | 0.0 | 1.2878 | 1.2591 | | vgg16 | 64 | 0.9997 | 0.9975 | 0.857 | 0.9727 | 1.2725 | 1.2643 | | Background_Matting | 4 | 0.9997 | 1.0162 | 0.888 | 1.0668 | 1.2358 | 1.2195 | | alexnet | 128 | 0.999 | 1.0095 | 0.8139 | 0.9333 | 1.2111 | 1.2066 | | hf_Reformer | 4 | 0.9982 | 0.9994 | 0.989 | 0.0 | 1.176 | 1.1763 | | timm_vision_transformer_large | 8 | 1.0001 | 0.9901 | 0.0 | 0.0 | 1.1566 | 1.1381 | | hf_BigBird | 2 | 0.9956 | 0.9179 | 1.0502 | 0.8352 | 1.1514 | 1.0266 | | timm_regnet | 32 | 0.9424 | 0.9323 | 0.7743 | 0.0 | 1.1364 | 1.0633 | | timm_vovnet | 32 | 0.9027 | 0.8845 | 0.7268 | 0.7912 | 1.1311 | 1.1024 | | yolov3 | 16 | 0.9997 | 0.9899 | 0.8052 | 0.0 | 1.0912 | 1.0661 | | tts_angular | 64 | 0.9754 | 0.9327 | 0.9762 | 0.958 | 1.0054 | 1.0154 | | demucs | 4 | 1.0007 | 0.9999 | 1.0009 | 1.0001 | 0.9998 | 1.0009 | | dlrm | 2048 | 1.0486 | 1.0937 | 0.0 | 1.0509 | 0.9958 | 1.2434 | | nvidia_deeprecommender | 256 | 0.9983 | 0.9959 | 0.697 | 1.0185 | 0.9891 | 1.0305 | | hf_GPT2_large | 4 | 1.0002 | 0.9905 | 0.0 | 0.0 | 0.0 | 1.8669 | | tacotron2 | 64 | 0.9761 | 0.7359 | 0.9515 | 0.582 | 0.0 | 0.8607 | | hf_Longformer | 2 | 0.9526 | 0.8656 | 0.8856 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.1019 | 8.3422 | 11.6879 | nan | 403.2431 | 417.214 | | timm_efficientdet | 1 | 19.6409 | 36.9645 | nan | nan | 133.9419 | 133.699 | | hf_T5_large | 2 | 13.9913 | 39.389 | nan | nan | 114.1342 | 112.679 | | timm_vision_transformer_large | 8 | 2.8449 | 15.0985 | nan | nan | 68.7173 | 66.5047 | | densenet121 | 4 | 2.246 | 11.8672 | 19.224 | 390.9763 | 47.2667 | 46.4024 | | hf_BigBird | 2 | 8.1722 | 15.2167 | 30.1318 | 117.7689 | 43.8864 | 30.1517 | | timm_resnest | 32 | 0.6064 | 2.54 | 3.8127 | nan | 38.3383 | 37.2622 | | attention_is_all_you_need_pytorch | 256 | 1.2896 | 7.2209 | nan | 169.8101 | 34.5882 | 33.9474 | | timm_vision_transformer | 8 | 0.945 | 4.4876 | 6.6976 | 89.9861 | 34.5309 | 33.8293 | | BERT_pytorch | 16 | 1.7472 | 7.6955 | nan | nan | 31.7326 | 31.0696 | | hf_Bart | 4 | 1.8065 | 8.8086 | nan | nan | 31.444 | 31.8325 | | timm_nfnet | 128 | 2.0963 | 7.1766 | nan | 175.2127 | 30.8125 | 30.3152 | | hf_T5 | 8 | 2.4903 | 8.9589 | nan | 101.4508 | 28.8873 | 28.0101 | | fastNLP_Bert | 6 | 1.7545 | 7.0668 | 11.3765 | nan | 28.2459 | 26.0499 | | pytorch_stargan | 16 | 0.4159 | 1.9479 | 2.8565 | nan | 27.8331 | 27.8789 | | timm_regnet | 32 | 2.3498 | 7.9628 | 19.5523 | nan | 26.846 | 25.8591 | | speech_transformer | 32 | 1.9201 | 8.7879 | 67.4912 | nan | 26.0148 | 25.7645 | | timm_efficientnet | 32 | 1.8272 | 6.6944 | 15.471 | nan | 25.7903 | 25.2972 | | mobilenet_v3_large | 32 | 0.9569 | 4.7526 | 7.2074 | 148.3817 | 25.1931 | 24.5998 | | pytorch_struct | 200 | 0.283 | 0.85 | 1.4563 | 7.8754 | 24.0851 | 18.3436 | | mnasnet1_0 | 32 | 0.8662 | 4.3294 | 6.7287 | 105.3488 | 20.6106 | 19.8488 | | hf_Bert | 4 | 1.6084 | 6.8483 | 10.1077 | nan | 20.3516 | 19.5741 | | resnet50 | 32 | 0.9123 | 4.5221 | 6.7655 | 122.5109 | 19.7755 | 19.2171 | | hf_Albert | 8 | 1.382 | 6.415 | 9.9697 | nan | 19.6841 | 19.058 | | hf_GPT2 | 4 | 1.5236 | 6.2508 | nan | 80.706 | 19.6719 | 18.8732 | | shufflenet_v2_x1_0 | 128 | 1.008 | 5.0082 | 7.5616 | 122.504 | 19.5504 | 19.2605 | | timm_vovnet | 32 | 1.5175 | 4.3979 | 9.833 | 76.7979 | 19.5444 | 18.8177 | | hf_Reformer | 4 | 1.6081 | 3.0671 | 6.0805 | nan | 19.5436 | 16.1662 | | resnext50_32x4d | 8 | 0.9639 | 4.5911 | 6.8183 | 109.3638 | 19.2394 | 18.9842 | | Super_SloMo | 6 | 1.0916 | 4.6248 | 6.5635 | nan | 18.9289 | 19.2667 | | mobilenet_v2 | 96 | 0.8518 | 4.5108 | 6.992 | 148.9779 | 18.551 | 18.1259 | | Background_Matting | 4 | 1.0155 | 4.6837 | 7.0886 | 93.3643 | 18.24 | 17.9737 | | hf_DistilBert | 8 | 0.6322 | 3.3956 | 7.1961 | 56.313 | 13.4271 | 12.9535 | | functorch_dp_cifar10 | 64 | 0.3789 | 1.6633 | 2.4313 | nan | 13.104 | 12.9868 | | resnet18 | 16 | 0.4403 | 1.7888 | 2.6449 | 43.0366 | 11.386 | 10.9793 | | pytorch_unet | 1 | 0.4537 | 1.999 | 3.0079 | 42.0987 | 9.0633 | 8.5562 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4326 | 1.9796 | 2.8145 | nan | 8.9294 | 8.6703 | | LearningToPaint | 96 | 0.4644 | 1.8812 | 2.9688 | 54.6466 | 7.7771 | 7.5577 | | squeezenet1_1 | 32 | 0.2577 | 0.9316 | 1.3972 | 6.8667 | 4.8105 | 4.6493 | | vgg16 | 64 | 0.2048 | 0.6565 | 1.0667 | 5.8613 | 4.0779 | 3.7611 | | drq | 1 | 0.1689 | 0.4894 | 0.9117 | 6.108 | 4.0024 | 3.5849 | | nvidia_deeprecommender | 256 | 0.2352 | 0.5155 | 0.8416 | 7.1974 | 3.6395 | 3.2773 | | dlrm | 2048 | 0.4649 | 0.847 | nan | 5.9005 | 3.6187 | 3.3449 | | soft_actor_critic | 256 | 0.2142 | 0.3642 | 0.609 | 3.8351 | 3.5014 | 2.8463 | | alexnet | 128 | 0.1732 | 0.4416 | 0.7115 | 5.4243 | 3.2697 | 3.2566 | | dcgan | 32 | 0.1746 | 0.4266 | 0.6576 | 5.9451 | 2.8542 | 2.6239 | | lennard_jones | 1000 | 0.1551 | 0.3498 | 0.5231 | 3.3197 | 2.199 | 1.9362 | | tts_angular | 64 | 0.2267 | 0.2777 | 0.4251 | 1.4574 | 1.9232 | 1.7451 | | demucs | 4 | 0.3509 | 0.3521 | 0.3513 | 0.3598 | 0.2784 | 0.2821 | | tacotron2 | 64 | 18.186 | 32.8702 | 49.9421 | 106.572 | nan | 64.272 | | hf_GPT2_large | 4 | 5.551 | 19.729 | nan | nan | nan | 51.1832 | | hf_Longformer | 2 | 6.47 | 14.3448 | 62.2958 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.9892 | 0.7707 | 0.2722 | nan | 1.2042 | 1.2299 | | hf_Albert | 8 | 0.9814 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3318 | nan | 1.1102 | 1.1173 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.312 | 0.5997 | 1.0603 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0536 | 1.2945 | | timm_nfnet | 128 | 0.9691 | 0.8985 | nan | 0.7873 | 1.0337 | 1.1293 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | nan | 0.7853 | 1.0179 | 1.1759 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9866 | | Background_Matting | 4 | 1.0059 | 0.9548 | 0.3707 | 0.9233 | 0.9831 | 1.0338 | | timm_efficientdet | 1 | 1.0254 | 0.8401 | nan | nan | 0.9822 | 1.011 | | BERT_pytorch | 16 | 1.0 | 0.8822 | nan | nan | 0.9741 | 1.1226 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | nan | 0.8601 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0017 | 0.8787 | 0.424 | nan | 0.951 | 1.0082 | | timm_regnet | 32 | 0.9945 | 0.8449 | 0.3499 | nan | 0.9371 | 1.0312 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 0.9059 | 0.9309 | 1.252 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.7273 | 0.911 | 1.0853 | | yolov3 | 16 | 0.985 | 0.8338 | 0.3517 | nan | 0.901 | 1.0402 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8357 | nan | nan | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9875 | 0.8721 | 0.3482 | nan | 0.876 | 0.9969 | | densenet121 | 4 | 0.9883 | 0.866 | 0.3662 | 0.7966 | 0.876 | 1.0026 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3903 | nan | 0.8735 | 0.942 | | squeezenet1_1 | 32 | 0.9595 | 0.7951 | 0.346 | 0.5757 | 0.8731 | 1.0627 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8419 | 0.3593 | 0.6692 | 0.8727 | 0.9966 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | resnet50 | 32 | 0.9888 | 0.8617 | 0.3558 | 0.733 | 0.8647 | 0.8839 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3413 | 0.8347 | 0.8384 | 0.9049 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.6247 | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9102 | 0.831 | nan | nan | 0.8232 | 0.9878 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.4542 | 0.9208 | 0.8111 | 1.096 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.7444 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8503 | 0.3456 | 0.6025 | 0.7902 | 0.816 | | pytorch_stargan | 16 | 0.9952 | 0.9697 | 0.427 | nan | 0.782 | 0.8863 | | timm_vovnet | 32 | 0.9895 | 0.7676 | 0.3404 | 0.7217 | 0.7791 | 0.8856 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.6125 | 0.7633 | 1.0588 | | resnext50_32x4d | 8 | 0.9947 | 0.8545 | 0.388 | 0.7725 | 0.7622 | 0.7746 | | mnasnet1_0 | 32 | 0.9788 | 0.8617 | 0.3406 | 0.6588 | 0.7529 | 0.7734 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8295 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9245 | 0.7232 | 0.3857 | 0.5464 | 0.7382 | 0.9269 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3922 | 0.8515 | 0.7151 | 0.7249 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | 0.704 | 0.7306 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.5571 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | hf_Reformer | 4 | 0.9861 | 0.9861 | 0.5889 | nan | 0.5295 | 0.9885 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4353 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1354 | | tacotron2 | 64 | 0.9866 | 0.3963 | 0.3143 | 0.3471 | nan | 0.4114 | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.3491 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 32 | 1.0172 | 0.8228 | 0.0 | 0.0 | 4.8803 | 1.8229 | | YituTechConvBert | 1 | 1.019 | 0.8224 | 0.0 | 0.0 | 4.5913 | 1.6349 | | CamemBert | 1 | 1.0394 | 0.8274 | 1.8693 | 0.0 | 4.2222 | 1.7729 | | MobileBertForQuestionAnswering | 64 | 1.0175 | 0.8296 | 0.0 | 0.0 | 3.7108 | 1.8122 | | MT5ForConditionalGeneration | 8 | 1.0161 | 0.8523 | 0.0 | 0.0 | 3.6067 | 2.5302 | | DistillGPT2 | 1 | 1.0283 | 0.8703 | 1.5759 | 0.2477 | 2.762 | 2.0174 | | M2M100ForConditionalGeneration | 8 | 1.011 | 0.8259 | 1.2734 | 0.0 | 2.6496 | 1.8618 | | GPT2ForSequenceClassification | 4 | 1.0013 | 0.9693 | 0.0 | 0.4966 | 2.3173 | 2.2816 | | MegatronBertForCausalLM | 16 | 1.0286 | 0.8502 | 1.062 | 0.0 | 2.1945 | 1.7767 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9784 | 0.7625 | 0.0 | 2.1226 | 2.0655 | | MegatronBertForQuestionAnswering | 16 | 1.0285 | 0.8587 | 1.0636 | 0.0 | 2.0355 | 1.8162 | | PLBartForConditionalGeneration | 16 | 1.0131 | 0.834 | 0.0 | 0.0 | 1.9943 | 1.7638 | | MBartForConditionalGeneration | 16 | 1.0143 | 0.8404 | 0.0 | 0.0 | 1.9835 | 1.6133 | | LayoutLMForSequenceClassification | 16 | 1.0003 | 0.9797 | 0.7712 | 0.0 | 1.8614 | 1.799 | | ElectraForCausalLM | 32 | 1.0003 | 0.9309 | 0.7148 | 0.0 | 1.8335 | 1.8331 | | XGLMForCausalLM | 8 | 1.01 | 0.8225 | 1.0923 | 0.2586 | 1.7755 | 1.6227 | | PegasusForConditionalGeneration | 16 | 1.0068 | 0.8389 | 1.0866 | 0.0 | 1.6941 | 1.5544 | | T5Small | 1 | 1.0235 | 0.8728 | 0.0 | 0.0 | 1.692 | 1.4053 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.8857 | 0.0 | 0.0 | 1.6707 | 1.6604 | | AlbertForMaskedLM | 4 | 1.0001 | 0.8855 | 0.0 | 0.0 | 1.6613 | 1.6506 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9617 | 0.7513 | 0.0 | 1.652 | 1.6283 | | T5ForConditionalGeneration | 4 | 1.0006 | 0.9158 | 0.0 | 0.9672 | 1.6175 | 1.5906 | | Speech2Text2ForCausalLM | 128 | 1.0026 | 0.9321 | 0.7236 | 0.0 | 1.5808 | 1.5611 | | OPTForCausalLM | 32 | 1.0098 | 0.9271 | 0.0 | 0.3064 | 1.5617 | 1.5397 | | DebertaForMaskedLM | 4 | 0.9239 | 0.7325 | 0.9205 | 0.0 | 1.5053 | 1.1791 | | BertForQuestionAnswering | 128 | 1.0002 | 0.9835 | 0.7719 | 0.0 | 1.4987 | 1.4724 | | DistilBertForQuestionAnswering | 64 | 0.9999 | 0.9503 | 0.7596 | 0.296 | 1.4931 | 1.4836 | | RobertaForQuestionAnswering | 128 | 0.9998 | 0.9832 | 0.7738 | 0.0 | 1.4731 | 1.4743 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0058 | 0.9139 | 0.0 | 0.0 | 1.4713 | 1.4726 | | BartForConditionalGeneration | 2 | 1.0049 | 0.9687 | 0.0 | 0.3125 | 1.4586 | 1.4267 | | BartForCausalLM | 4 | 0.9997 | 0.968 | 0.0 | 0.0 | 1.4476 | 1.4442 | | RobertaForCausalLM | 64 | 1.0 | 0.9585 | 0.7465 | 0.0 | 1.4349 | 1.4302 | | BertForMaskedLM | 64 | 1.0004 | 0.9554 | 0.7406 | 0.0 | 1.3509 | 1.3358 | | PLBartForCausalLM | 32 | 1.004 | 0.9396 | 0.7904 | 0.0 | 1.2854 | 1.2795 | | BlenderbotSmallForCausalLM | 64 | 1.0035 | 0.9267 | 0.7188 | 0.0 | 1.2645 | 1.287 | | DistilBertForMaskedLM | 64 | 0.9997 | 0.9518 | 0.709 | 0.3199 | 1.2638 | 1.2664 | | MBartForCausalLM | 32 | 1.0032 | 0.9501 | 0.0 | 0.0 | 1.2541 | 1.1984 | | TrOCRForCausalLM | 32 | 1.0014 | 0.9486 | 0.0 | 0.0 | 1.1955 | 1.1924 | | DebertaForQuestionAnswering | 8 | 0.9931 | 0.8768 | 0.7229 | 0.0 | 1.1932 | 1.2198 | | PegasusForCausalLM | 32 | 1.0014 | 0.9544 | 0.751 | 0.0 | 1.1897 | 1.1851 | | BigBird | 1 | 0.9919 | 0.9082 | 1.0304 | 0.0 | 1.1472 | 1.0294 | | AllenaiLongformerBase | 1 | 0.9358 | 0.7265 | 0.9252 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.1095 | 11.1103 | 35.5171 | nan | 99.068 | 38.7155 | | DebertaForMaskedLM | 4 | 5.2109 | 11.1671 | 36.0997 | nan | 98.8949 | 37.1335 | | XGLMForCausalLM | 8 | 2.7391 | 13.5243 | 52.9355 | 266.9685 | 75.2207 | 73.5766 | | MobileBertForMaskedLM | 32 | 9.4954 | 32.257 | nan | nan | 69.9782 | 64.627 | | M2M100ForConditionalGeneration | 8 | 3.657 | 17.0953 | 27.4483 | nan | 68.1905 | 59.7939 | | MobileBertForQuestionAnswering | 64 | 9.5225 | 32.4567 | nan | nan | 65.4136 | 63.6043 | | PegasusForConditionalGeneration | 16 | 3.4202 | 17.7277 | 27.6826 | nan | 53.1692 | 48.9871 | | MBartForConditionalGeneration | 16 | 3.5011 | 16.8622 | nan | nan | 50.4775 | 49.0242 | | BartForConditionalGeneration | 2 | 3.5257 | 16.9564 | nan | 402.2298 | 50.1043 | 49.023 | | BigBird | 1 | 8.321 | 15.4707 | 31.4819 | nan | 44.3614 | 28.6819 | | YituTechConvBert | 1 | 2.5371 | 10.8041 | nan | nan | 44.3172 | 40.754 | | MegatronBertForCausalLM | 16 | 3.7105 | 14.3218 | 23.2736 | nan | 41.4696 | 39.2219 | | MegatronBertForQuestionAnswering | 16 | 3.7138 | 14.1172 | 22.1442 | nan | 39.8278 | 38.7063 | | MT5ForConditionalGeneration | 8 | 3.7621 | 12.7501 | nan | nan | 37.5667 | 35.892 | | BlenderbotSmallForConditionalGeneration | 64 | 2.2111 | 11.1677 | nan | nan | 34.7555 | 32.7944 | | T5Small | 1 | 2.3675 | 8.7501 | nan | nan | 31.6615 | 30.5684 | | T5ForConditionalGeneration | 4 | 2.3872 | 8.7096 | nan | 103.989 | 29.9551 | 29.3924 | | PLBartForConditionalGeneration | 16 | 1.799 | 8.7068 | nan | nan | 29.9296 | 29.4767 | | LayoutLMForSequenceClassification | 16 | 2.0802 | 7.4395 | 11.8835 | nan | 28.3241 | 27.2161 | | ElectraForCausalLM | 32 | 1.743 | 7.2864 | 11.1242 | nan | 27.4804 | 24.8538 | | PegasusForCausalLM | 32 | 1.2735 | 6.4224 | 10.2415 | nan | 25.5703 | 23.3271 | | MBartForCausalLM | 32 | 1.2571 | 6.3217 | nan | nan | 24.1306 | 22.7253 | | LayoutLMForMaskedLM | 16 | 2.0444 | 7.4802 | 11.8846 | nan | 22.7716 | 21.8339 | | TrOCRForCausalLM | 32 | 1.2466 | 6.3432 | nan | nan | 22.6472 | 21.6722 | | BartForCausalLM | 4 | 1.3009 | 6.4611 | nan | nan | 22.6099 | 21.4643 | | BertForMaskedLM | 64 | 1.6141 | 7.3073 | 10.2931 | nan | 21.8908 | 21.0838 | | OPTForCausalLM | 32 | 1.3218 | 6.4969 | nan | 111.0135 | 21.8161 | 20.9369 | | ElectraForQuestionAnswering | 64 | 1.6881 | 7.0473 | 10.7501 | nan | 21.7938 | 20.5196 | | RobertaForCausalLM | 64 | 1.5995 | 6.9289 | 10.5618 | nan | 21.5658 | 20.5392 | | BertForQuestionAnswering | 128 | 1.6017 | 7.4011 | 10.5464 | nan | 20.9961 | 20.0381 | | RobertaForQuestionAnswering | 128 | 1.6579 | 6.8479 | 10.5019 | nan | 20.3386 | 19.1481 | | CamemBert | 1 | 1.7067 | 7.042 | 10.6348 | nan | 20.3264 | 20.0115 | | GPT2ForSequenceClassification | 4 | 1.5184 | 6.2326 | nan | 83.2239 | 19.2188 | 18.9186 | | AlbertForMaskedLM | 4 | 1.501 | 6.7249 | nan | nan | 19.1891 | 18.0312 | | AlbertForQuestionAnswering | 4 | 1.4421 | 6.5428 | nan | nan | 18.6982 | 17.5413 | | BlenderbotSmallForCausalLM | 64 | 0.8844 | 4.3416 | 6.5906 | nan | 17.016 | 16.7401 | | PLBartForCausalLM | 32 | 0.664 | 3.3298 | 4.9017 | nan | 15.841 | 14.7605 | | Speech2Text2ForCausalLM | 128 | 0.6985 | 3.2189 | 5.5963 | nan | 15.4493 | 14.5381 | | DistillGPT2 | 1 | 0.8114 | 3.1762 | 4.7938 | 55.0455 | 13.4961 | 13.4034 | | DistilBertForMaskedLM | 64 | 0.6015 | 3.3324 | 7.2285 | 56.6247 | 13.005 | 12.4303 | | DistilBertForQuestionAnswering | 64 | 0.6369 | 3.3455 | 7.602 | 54.7173 | 12.2971 | 11.8192 | | AllenaiLongformerBase | 1 | 7.103 | 15.0884 | 60.0037 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9163 | nan | 0.8823 | 1.0775 | 1.1632 | | BartForCausalLM | 4 | 1.0 | 0.8997 | nan | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | 0.8978 | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | nan | 0.9708 | 1.0363 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | nan | 0.8385 | 0.9662 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | nan | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.8865 | nan | nan | 0.9567 | 1.1277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | nan | nan | 0.9417 | 1.0114 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 0.9991 | 0.8994 | 0.3788 | nan | 0.9289 | 0.9789 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3469 | 0.81 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | nan | 0.6772 | 0.925 | 1.0061 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | nan | nan | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | nan | nan | 0.921 | 0.9877 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | nan | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0276 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | nan | nan | 0.8843 | 1.0294 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8599 | 0.3635 | 0.6373 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | nan | nan | 0.8751 | 0.919 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | nan | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.856 | 0.9308 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3979 | nan | 0.8549 | 0.9361 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | BigBird | 1 | 1.0008 | 0.9547 | 0.4478 | nan | 0.8178 | 1.0885 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8061 | 0.9309 | | XGLMForCausalLM | 8 | 0.9918 | 0.9231 | 0.4336 | 0.7755 | 0.8055 | 0.9902 | | DistillGPT2 | 1 | 0.9963 | 0.7984 | 0.4006 | 0.7468 | 0.7997 | 1.016 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | nan | nan | 0.7883 | 0.9276 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9354 | 0.4372 | nan | 0.7431 | 1.0255 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | nan | nan | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | nan | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9824 | 0.3624 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9476 | 0.3853 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0003 | 0.0 | 0.0 | 0.0 | 2.228 | 1.7504 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9984 | 0.0 | 0.0 | 2.1261 | 2.091 | | ghostnet_100 | 128 | 0.9993 | 0.9767 | 0.8599 | 0.0 | 2.0076 | 1.9294 | | twins_pcpvt_base | 64 | 1.0077 | 0.9305 | 0.8714 | 0.0 | 1.9923 | 1.7166 | | lcnet_050 | 128 | 0.9549 | 0.9356 | 0.7851 | 1.1351 | 1.8681 | 1.7161 | | regnety_002 | 128 | 0.974 | 0.9434 | 0.9019 | 0.8609 | 1.6983 | 1.5332 | | volo_d1_224 | 64 | 0.9998 | 0.9942 | 0.0 | 0.0 | 1.5969 | 1.5641 | | dla102 | 128 | 1.0 | 0.996 | 0.8386 | 0.0 | 1.583 | 1.5518 | | gmlp_s16_224 | 128 | 0.9998 | 0.9436 | 0.0 | 0.5348 | 1.5568 | 1.5259 | | nfnet_l0 | 128 | 0.9993 | 0.8099 | 0.7138 | 0.9321 | 1.5534 | 1.4589 | | coat_lite_mini | 128 | 0.9998 | 0.9768 | 0.859 | 0.0 | 1.5531 | 1.531 | | swin_base_patch4_window7_224 | 64 | 0.9995 | 0.9604 | 0.0 | 0.0 | 1.5435 | 1.4748 | | hrnet_w18 | 128 | 1.0001 | 0.9947 | 0.8383 | 0.0 | 1.5422 | 1.4612 | | resnest101e | 64 | 0.9995 | 0.9916 | 0.821 | 0.0 | 1.5358 | 1.433 | | gmixer_24_224 | 128 | 0.9999 | 0.8428 | 0.0 | 0.0 | 1.5256 | 1.5097 | | gluon_inception_v3 | 128 | 0.9999 | 0.9964 | 0.8548 | 1.1512 | 1.5056 | 1.4722 | | adv_inception_v3 | 128 | 1.0 | 0.9958 | 0.8547 | 1.151 | 1.505 | 1.4715 | | inception_v3 | 128 | 1.0 | 0.9934 | 0.8541 | 1.151 | 1.5031 | 1.468 | | dm_nfnet_f0 | 128 | 1.0001 | 0.999 | 0.0 | 1.0865 | 1.5016 | 1.4305 | | convnext_base | 64 | 0.9998 | 0.998 | 0.0 | 0.0 | 1.4715 | 1.4581 | | res2net50_14w_8s | 128 | 0.9999 | 0.9897 | 0.8116 | 1.1568 | 1.4665 | 1.4056 | | cait_m36_384 | 4 | 1.0001 | 1.0108 | 0.0 | 0.0 | 1.4638 | 1.4153 | | mobilenetv3_large_100 | 128 | 0.9544 | 0.9448 | 0.7832 | 0.0 | 1.4568 | 1.4709 | | crossvit_9_240 | 128 | 0.9999 | 0.9943 | 0.8384 | 0.0 | 1.445 | 1.4171 | | selecsls42b | 128 | 0.9997 | 0.9954 | 0.8429 | 1.2947 | 1.4443 | 1.4123 | | mnasnet_100 | 128 | 0.9534 | 0.9445 | 0.7887 | 1.2879 | 1.4364 | 1.4589 | | res2next50 | 128 | 0.9997 | 0.9959 | 0.8321 | 1.1638 | 1.4128 | 1.3472 | | mobilenetv2_100 | 128 | 0.9507 | 0.9357 | 0.7228 | 1.2573 | 1.4066 | 1.4323 | | jx_nest_base | 32 | 0.9999 | 0.9932 | 0.0 | 0.0 | 1.392 | 1.3543 | | fbnetv3_b | 128 | 0.9528 | 0.9408 | 0.7757 | 0.0 | 1.3913 | 1.4066 | | ese_vovnet19b_dw | 128 | 0.9709 | 0.9629 | 0.7684 | 1.1959 | 1.3762 | 1.3793 | | mobilevit_s | 64 | 0.9731 | 0.8144 | 0.6567 | 0.0 | 1.3722 | 1.3619 | | spnasnet_100 | 128 | 0.945 | 0.9374 | 0.7783 | 1.1988 | 1.3692 | 1.4006 | | fbnetc_100 | 128 | 0.9509 | 0.9442 | 0.7897 | 0.0 | 1.3545 | 1.3769 | | tf_efficientnet_b0 | 128 | 0.9637 | 0.8051 | 0.6659 | 0.0 | 1.352 | 1.3566 | | convit_base | 64 | 1.0001 | 0.9966 | 0.0 | 0.0 | 1.3387 | 1.3373 | | poolformer_m36 | 64 | 0.9998 | 0.9986 | 0.8083 | 0.0 | 1.3294 | 1.2963 | | pit_b_224 | 64 | 0.9996 | 0.9954 | 0.8215 | 0.6783 | 1.3218 | 1.3139 | | cspdarknet53 | 64 | 0.9428 | 0.9336 | 0.7551 | 1.2624 | 1.3208 | 1.3453 | | botnet26t_256 | 128 | 0.9794 | 0.9709 | 0.8098 | 0.0 | 1.2986 | 1.2934 | | res2net101_26w_4s | 64 | 1.0037 | 1.0023 | 0.7914 | 0.0 | 1.2861 | 1.2335 | | eca_botnext26ts_256 | 128 | 0.9805 | 0.8089 | 0.6708 | 1.1618 | 1.2807 | 1.2725 | | deit_base_distilled_patch16_224 | 64 | 1.0001 | 0.9924 | 0.7973 | 0.5971 | 1.2806 | 1.2644 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9785 | 0.0 | 0.0 | 1.2794 | 1.2654 | | mixer_b16_224 | 128 | 1.0 | 0.9593 | 0.7773 | 0.0 | 1.2787 | 1.2651 | | rexnet_100 | 128 | 0.9636 | 0.8512 | 0.6913 | 0.0 | 1.2764 | 1.2711 | | visformer_small | 128 | 0.9999 | 1.0018 | 0.8436 | 0.0 | 1.2407 | 1.1788 | | tinynet_a | 128 | 0.9489 | 0.7956 | 0.6463 | 0.0 | 1.236 | 1.2399 | | pnasnet5large | 16 | 0.9992 | 0.992 | 0.7948 | 0.0 | 1.2169 | 1.2329 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9931 | 0.8348 | 0.637 | 1.195 | 1.1839 | | sebotnet33ts_256 | 64 | 0.966 | 0.8335 | 0.6751 | 1.0258 | 1.1866 | 1.2029 | | tf_mixnet_l | 128 | 0.9805 | 0.9096 | 0.7964 | 0.0 | 1.1787 | 1.1707 | | mixnet_l | 128 | 0.9796 | 0.9055 | 0.7953 | 0.0 | 1.1581 | 1.1539 | | gluon_xception65 | 32 | 0.9995 | 0.9884 | 0.7546 | 0.0 | 1.1574 | 1.1257 | | swsl_resnext101_32x16d | 32 | 0.9997 | 0.9847 | 0.8175 | 0.0 | 1.1441 | 1.0704 | | repvgg_a2 | 128 | 0.9434 | 0.9358 | 0.8012 | 0.0 | 1.1426 | 1.1573 | | dpn107 | 32 | 0.9311 | 0.9132 | 0.7443 | 0.0 | 1.1409 | 1.1521 | | resmlp_12_224 | 128 | 0.9997 | 1.0085 | 0.7907 | 1.0592 | 1.1073 | 1.1395 | | gernet_l | 128 | 0.9472 | 0.9379 | 0.7693 | 1.0723 | 1.069 | 1.0732 | | convmixer_768_32 | 32 | 0.9999 | 0.998 | 0.9227 | 0.0 | 1.0559 | 1.0506 | | eca_halonext26ts | 128 | 0.9811 | 0.8167 | 0.6782 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | fail_accuracy | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | fail_to_run | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.2737 | 30.1144 | 56.5118 | nan | 117.1443 | 112.2745 | | twins_pcpvt_base | 64 | 2.8306 | 14.958 | 25.7384 | nan | 99.8265 | 96.5675 | | pnasnet5large | 16 | 5.0899 | 22.0959 | 40.9018 | nan | 82.5617 | 80.0751 | | xcit_large_24_p8_224 | 5 | 3.2174 | nan | nan | nan | 81.4524 | 75.3646 | | swin_base_patch4_window7_224 | 64 | 2.8854 | 12.6895 | nan | nan | 78.5573 | 75.3785 | | mobilevit_s | 64 | 1.9157 | 7.5584 | 15.3941 | nan | 75.6051 | 69.9929 | | cait_m36_384 | 4 | 3.4573 | 19.5442 | nan | nan | 71.263 | 66.7497 | | convnext_base | 64 | 1.4953 | 7.0952 | nan | nan | 71.0059 | 68.0831 | | resnest101e | 64 | 3.4362 | 16.0954 | 27.2433 | nan | 68.3207 | 66.1693 | | res2net101_26w_4s | 64 | 3.2719 | 16.1476 | 28.0669 | nan | 59.988 | 57.8445 | | jx_nest_base | 32 | 1.8177 | 8.9739 | nan | nan | 59.0141 | 56.4404 | | res2net50_14w_8s | 128 | 2.8963 | 14.5919 | 24.3322 | 429.0255 | 55.5525 | 51.8946 | | sebotnet33ts_256 | 64 | 1.6923 | 6.0262 | 14.7512 | 190.0486 | 52.8296 | 50.0898 | | coat_lite_mini | 128 | 1.2118 | 5.234 | 8.4948 | nan | 52.5414 | 51.4946 | | poolformer_m36 | 64 | 2.0445 | 8.106 | 13.1404 | nan | 47.8697 | 45.4084 | | gmlp_s16_224 | 128 | 1.2697 | 7.3137 | nan | 204.3953 | 43.3153 | 41.5146 | | dpn107 | 32 | 4.089 | 13.296 | 38.861 | nan | 42.8777 | 40.9218 | | eca_botnext26ts_256 | 128 | 1.5202 | 5.1257 | 10.7851 | 151.7105 | 42.1161 | 41.4194 | | fbnetv3_b | 128 | 3.3292 | 11.5433 | 27.5262 | nan | 40.7296 | 38.8006 | | volo_d1_224 | 64 | 1.4406 | 7.4705 | nan | nan | 40.7135 | 37.6314 | | crossvit_9_240 | 128 | 1.6525 | 8.6329 | 13.8628 | nan | 40.6733 | 39.1454 | | botnet26t_256 | 128 | 1.3788 | 4.3448 | 9.3565 | nan | 40.3315 | 39.4864 | | gluon_xception65 | 32 | 2.0661 | 10.8081 | 18.0734 | nan | 39.6841 | 38.465 | | tnt_s_patch16_224 | 128 | 1.9436 | 10.4847 | nan | nan | 39.299 | 37.3575 | | inception_v3 | 128 | 1.7208 | 8.471 | 13.2771 | 204.4555 | 37.457 | 34.6516 | | gluon_inception_v3 | 128 | 1.7431 | 8.289 | 13.28 | 201.9188 | 36.8506 | 34.639 | | adv_inception_v3 | 128 | 1.6845 | 8.4631 | 13.2633 | 202.9434 | 36.7435 | 35.4308 | | mixnet_l | 128 | 5.6522 | 12.4131 | 26.2082 | nan | 36.2128 | 33.3753 | | tf_mixnet_l | 128 | 5.7655 | 12.752 | 26.8064 | nan | 36.0616 | 34.1135 | | ghostnet_100 | 128 | 3.1729 | 9.592 | 14.448 | nan | 35.7794 | 34.6878 | | gmixer_24_224 | 128 | 1.5475 | 8.3444 | nan | nan | 35.2423 | 32.6876 | | dla102 | 128 | 1.8929 | 9.4355 | 15.2939 | nan | 34.7664 | 33.0414 | | swsl_resnext101_32x16d | 32 | 1.9838 | 9.0422 | 14.8223 | nan | 34.1429 | 32.2967 | | dm_nfnet_f0 | 128 | 2.1687 | 7.6652 | nan | 175.3094 | 31.6804 | 30.2873 | | res2next50 | 128 | 1.7757 | 8.2179 | 12.8718 | 261.6936 | 31.4964 | 29.4527 | | convit_base | 64 | 1.4009 | 6.1528 | nan | nan | 31.0617 | 29.8589 | | rexnet_100 | 128 | 2.0992 | 7.2863 | 17.2375 | nan | 29.7665 | 28.6398 | | tinynet_a | 128 | 2.1503 | 7.9655 | 20.2758 | nan | 29.0519 | 27.6025 | | cspdarknet53 | 64 | 2.4101 | 7.3254 | 19.2249 | 164.6576 | 26.6654 | 24.5698 | | tf_efficientnet_b0 | 128 | 1.922 | 7.0169 | 16.6281 | nan | 25.432 | 24.3008 | | fbnetc_100 | 128 | 2.1134 | 6.74 | 17.3646 | nan | 24.5958 | 23.0541 | | visformer_small | 128 | 1.0879 | 3.9437 | 6.6709 | nan | 24.4938 | 22.5798 | | convmixer_768_32 | 32 | 1.2465 | 6.199 | 9.844 | nan | 24.2964 | 22.8918 | | mixer_b16_224 | 128 | 0.834 | 3.795 | 6.9428 | nan | 24.2767 | 23.4136 | | resmlp_12_224 | 128 | 0.7294 | 3.1186 | 6.8161 | 66.3836 | 24.2736 | 22.5395 | | spnasnet_100 | 128 | 2.0622 | 6.5117 | 16.754 | 162.487 | 24.2242 | 22.507 | | deit_base_distilled_patch16_224 | 64 | 1.0866 | 4.6147 | 7.2959 | 85.473 | 24.1252 | 23.2602 | | vit_base_patch16_224 | 64 | 0.9959 | 4.6512 | 7.6936 | 96.2936 | 23.8381 | 22.4128 | | nfnet_l0 | 128 | 1.917 | 7.2775 | 10.8839 | 158.8476 | 23.1506 | 22.3053 | | mobilenetv3_large_100 | 128 | 1.6667 | 5.5408 | 13.1919 | nan | 22.7273 | 21.9031 | | beit_base_patch16_224 | 64 | 1.4042 | 5.4752 | nan | nan | 22.1323 | 20.9479 | | mobilenetv2_100 | 128 | 1.64 | 5.7045 | 12.9245 | 143.0942 | 20.8492 | 19.6917 | | pit_b_224 | 64 | 1.1476 | 5.2318 | 8.5997 | 109.6685 | 20.5913 | 20.2604 | | mnasnet_100 | 128 | 1.7589 | 5.2789 | 13.1004 | 127.3454 | 20.4271 | 19.239 | | regnety_002 | 128 | 1.7295 | 5.7896 | 13.0207 | 140.4163 | 20.1457 | 19.5505 | | gernet_l | 128 | 2.045 | 6.026 | 15.261 | 128.7103 | 19.8182 | 19.6134 | | repvgg_a2 | 128 | 2.0955 | 6.0274 | 15.2765 | nan | 19.5669 | 18.654 | | selecsls42b | 128 | 0.8599 | 3.8063 | 5.8819 | 102.2291 | 18.1047 | 17.2732 | | lcnet_050 | 128 | 1.1604 | 3.3607 | 8.3092 | 90.8627 | 14.9955 | 13.8065 | | ese_vovnet19b_dw | 128 | 1.0643 | 3.2042 | 6.71 | 68.4662 | 13.9419 | 13.2544 | | eca_halonext26ts | 128 | 1.4804 | 5.0425 | 11.7709 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7887 | 0.2764 | nan | 1.37 | 1.5056 | | gmixer_24_224 | 128 | 0.9926 | 0.9248 | nan | nan | 1.3102 | 1.3732 | | gmlp_s16_224 | 128 | 0.9938 | 0.9496 | nan | 0.9046 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9877 | 0.7695 | 0.2664 | nan | 1.1888 | 1.3559 | | mobilevit_s | 64 | 0.993 | 0.7669 | 0.2733 | nan | 1.1832 | 1.3099 | | pnasnet5large | 16 | 1.0567 | 0.9911 | 0.3633 | nan | 1.1589 | 1.2896 | | rexnet_100 | 128 | 0.9884 | 0.7848 | 0.2852 | nan | 1.147 | 1.3177 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.5588 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9983 | 0.9433 | 0.3413 | nan | 1.1018 | 1.1171 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3474 | nan | 1.0595 | 1.1461 | | mobilenetv2_100 | 128 | 0.9859 | 0.7635 | 0.3107 | 0.5982 | 1.0588 | 1.1524 | | convit_base | 64 | 0.9966 | 0.8516 | nan | nan | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | nan | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9692 | 0.8981 | nan | 0.7871 | 1.0336 | 1.1292 | | nfnet_l0 | 128 | 0.9887 | 0.8167 | 0.2681 | 0.524 | 1.0318 | 1.1803 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.7316 | 0.9907 | 1.2281 | | fbnetv3_b | 128 | 0.9872 | 0.783 | 0.3151 | nan | 0.986 | 1.043 | | convmixer_768_32 | 32 | 0.9972 | 0.9785 | 0.3447 | nan | 0.9759 | 0.9792 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9749 | 1.0803 | | visformer_small | 128 | 0.9897 | 0.9255 | 0.3467 | nan | 0.9613 | 1.0514 | | dla102 | 128 | 0.9684 | 0.9114 | 0.3366 | nan | 0.9555 | 1.0311 | | ghostnet_100 | 128 | 0.9758 | 0.8691 | 0.337 | nan | 0.9485 | 1.0705 | | tf_mixnet_l | 128 | 0.9907 | 0.8555 | 0.2874 | nan | 0.9364 | 1.0873 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9773 | 0.8402 | 0.3302 | nan | 0.9298 | 1.0259 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9775 | | ese_vovnet19b_dw | 128 | 0.9855 | 0.8559 | 0.3271 | 0.7015 | 0.9176 | 1.0681 | | swsl_resnext101_32x16d | 32 | 0.9988 | 0.8771 | 0.3668 | nan | 0.9093 | 0.9794 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | nan | 0.9073 | 0.9799 | | dpn107 | 32 | 0.9981 | 0.9084 | 0.3529 | nan | 0.9063 | 0.996 | | res2net101_26w_4s | 64 | 0.994 | 0.9149 | 0.3339 | nan | 0.8973 | 0.9734 | | gluon_xception65 | 32 | 0.9955 | 0.8848 | 0.3346 | nan | 0.8967 | 0.9753 | | gluon_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0257 | | adv_inception_v3 | 128 | 0.9816 | 0.862 | 0.3342 | 0.6582 | 0.8967 | 1.0255 | | hrnet_w18 | 128 | 0.9914 | 0.9175 | 0.3348 | nan | 0.8966 | 1.0033 | | fbnetc_100 | 128 | 0.9783 | 0.8475 | 0.33 | nan | 0.8954 | 0.9865 | | selecsls42b | 128 | 0.9796 | 0.8773 | 0.3533 | 0.7509 | 0.8919 | 0.9903 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8683 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8697 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9789 | 0.8799 | 0.3346 | 0.6734 | 0.8798 | 0.9821 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | 0.7788 | 0.8764 | 0.9737 | | convnext_base | 64 | 1.003 | 0.9263 | nan | nan | 0.8763 | 0.9864 | | res2next50 | 128 | 0.991 | 0.9094 | 0.3201 | 0.7727 | 0.8709 | 0.9666 | | mnasnet_100 | 128 | 0.976 | 0.8703 | 0.335 | 0.654 | 0.8707 | 0.98 | | mixnet_l | 128 | 0.99 | 0.8442 | 0.2717 | nan | 0.8702 | 1.0089 | | gernet_l | 128 | 0.9791 | 0.85 | 0.3444 | 0.7405 | 0.8617 | 0.9861 | | cspdarknet53 | 64 | 0.9914 | 0.8402 | 0.324 | 0.6522 | 0.8604 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.8639 | 0.3308 | nan | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.943 | 0.7564 | 0.3361 | 0.676 | 0.8449 | 0.9433 | | regnety_002 | 128 | 0.9509 | 0.7946 | 0.3398 | 0.5703 | 0.8352 | 1.0081 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | nan | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | nan | 0.8032 | 1.0344 | | repvgg_a2 | 128 | 0.9769 | 0.7822 | 0.3409 | nan | 0.7908 | 0.9914 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | 0.6208 | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | sebotnet33ts_256 | 64 | 0.9929 | 0.7076 | 0.3212 | 0.577 | 0.7451 | 0.8294 | | jx_nest_base | 32 | 0.9983 | 0.8927 | nan | nan | 0.6707 | 0.8618 | | eca_halonext26ts | 128 | 0.9885 | 0.7747 | 0.2672 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/A9ipaez.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/5vpNnBE.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/FSKoGdt.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 53/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 96%, 53/55 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 82%, 45/55 | 77%, 33/43  | 44%, 27/61  |
|    nvprims_nvfuser     | 55%, 30/55 | 93%, 40/43  | 31%, 19/61  |
|        inductor        | 85%, 47/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.03x    |    1.11x    |
|        inductor        |   1.50x    |    1.29x    |    1.25x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.16    |    2.43     |    1.91     |
|       aot_eager        |    5.77    |    7.84     |    7.05     |
|     aot_cudagraphs     |    8.60    |    16.10    |    13.16    |
|    nvprims_nvfuser     |   73.63    |   109.11    |   124.35    |
|        inductor        |   29.31    |    29.54    |    34.71    |
| inductor_no_cudagraphs |   28.61    |    25.45    |    33.28    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.85x    |    0.87x    |    0.84x    |
|        inductor        |   0.87x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   1.01x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7378 | 0.9441 | | torchbench | soft_actor_critic | 1.4286 | 0.9322 | | torchbench | nvidia_deeprecommender | 0.9036 | 0.9642 | | torchbench | dlrm | 0.0 | 1.0444 | | torchbench | hf_GPT2_large | 0.0 | 1.4742 | | torchbench | hf_T5 | 0.0 | 1.5685 | | torchbench | tacotron2 | 0.0 | 0.9028 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.7921 | 0.8299 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5428 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | yolov3 | 371.9531 | 363.8208 | | torchbench | timm_efficientdet | 122.8743 | 119.0122 | | torchbench | tacotron2 | nan | 63.1371 | | torchbench | hf_GPT2_large | nan | 41.0449 | | torchbench | hf_T5 | nan | 26.4199 | | torchbench | dlrm | nan | 2.9103 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | tnt_s_patch16_224 | nan | 31.7087 | +-------------+-----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0018 | | torchbench | hf_Albert | 0.8836 | 1.2212 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8622 | 1.0312 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8318 | 1.1277 | | torchbench | resnext50_32x4d | 0.8302 | 0.8356 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | hf_BigBird | 0.8211 | 1.0391 | | torchbench | dcgan | 0.767 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7517 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7062 | 1.0016 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6889 | 0.916 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4214 | | torchbench | tacotron2 | nan | 1.1623 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | dlrm | nan | 0.7306 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8453 | 1.0606 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | BigBird | 0.821 | 1.0085 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | M2M100ForConditionalGeneration | 0.8138 | 1.0093 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | YituTechConvBert | 0.7888 | 0.8725 | | huggingface | PegasusForCausalLM | 0.7774 | 0.931 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | GoogleFnet | 0.7698 | 0.9372 | | huggingface | MT5ForConditionalGeneration | 0.763 | 0.9406 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7487 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.7238 | 0.9373 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6764 | 0.8848 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.386 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8563 | 1.0752 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | coat_lite_mini | 0.821 | 1.0246 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | resmlp_12_224 | 0.7899 | 0.7979 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7462 | 0.9008 | | timm_models | crossvit_9_240 | 0.6584 | 0.8853 | | timm_models | tnt_s_patch16_224 | nan | 0.8622 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/BYJJ8qR.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/9SJjmMx.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0008 | 1.0057 | 2.3434 | 0.0 | 5.2693 | 1.2666 | | timm_efficientdet | 1 | 0.9803 | 0.8926 | 1.8373 | 0.0 | 4.2948 | 1.5047 | | functorch_dp_cifar10 | 64 | 1.0098 | 1.0288 | 2.1432 | 0.0 | 3.7607 | 1.2459 | | timm_vision_transformer | 8 | 1.0061 | 0.9367 | 1.5235 | 0.6774 | 2.597 | 1.4078 | | drq | 1 | 1.0063 | 0.8655 | 1.66 | 0.701 | 2.4435 | 1.064 | | BERT_pytorch | 16 | 1.0128 | 0.888 | 1.11 | 0.9921 | 2.0945 | 2.1387 | | resnext50_32x4d | 8 | 1.0028 | 1.1006 | 1.2921 | 0.0 | 2.0234 | 1.192 | | mobilenet_v3_large | 32 | 1.0036 | 1.1076 | 1.0129 | 0.0 | 1.9873 | 1.3401 | | resnet18 | 16 | 1.0019 | 1.1088 | 1.148 | 0.0 | 1.8543 | 1.2494 | | pytorch_struct | 200 | 0.9969 | 0.7519 | 0.8876 | 0.8095 | 1.8197 | 1.1619 | | squeezenet1_1 | 32 | 0.9946 | 1.0094 | 1.0664 | 0.8555 | 1.7465 | 1.2652 | | lennard_jones | 1000 | 0.9615 | 0.8552 | 1.0328 | 0.6864 | 1.7378 | 0.9441 | | hf_T5_large | 2 | 1.0245 | 0.9081 | 0.0 | 0.9845 | 1.6753 | 1.9295 | | dcgan | 32 | 0.9805 | 1.0136 | 1.2702 | 0.7708 | 1.6664 | 1.0562 | | hf_Albert | 8 | 1.0012 | 0.9963 | 0.7507 | 1.4773 | 1.6427 | 1.6398 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9993 | 1.0074 | 1.3055 | 0.8421 | 1.6241 | 1.3441 | | speech_transformer | 32 | 1.0061 | 0.9316 | 1.5091 | 0.8117 | 1.5487 | 1.5451 | | shufflenet_v2_x1_0 | 128 | 1.0027 | 1.0438 | 0.8067 | 0.0 | 1.5411 | 1.3854 | | timm_resnest | 32 | 0.9992 | 1.0022 | 0.8044 | 0.0 | 1.5171 | 1.4537 | | hf_GPT2 | 4 | 1.0075 | 0.9813 | 0.7396 | 0.4168 | 1.4972 | 1.4989 | | timm_nfnet | 128 | 0.9995 | 1.0001 | 0.0 | 1.2476 | 1.4723 | 1.4237 | | mnasnet1_0 | 32 | 1.001 | 1.0946 | 0.8619 | 0.0 | 1.4645 | 1.2723 | | mobilenet_v2_quantized_qat | 96 | 1.0015 | 0.9797 | 0.0 | 0.0 | 1.4301 | 1.4311 | | mobilenet_v2 | 96 | 0.9996 | 0.9989 | 0.7294 | 0.0 | 1.4289 | 1.4017 | | soft_actor_critic | 256 | 0.9774 | 0.8054 | 1.0894 | 0.6863 | 1.4286 | 0.9322 | | fastNLP_Bert | 6 | 0.999 | 0.9764 | 0.7511 | 1.1759 | 1.4211 | 1.3917 | | resnet50_quantized_qat | 32 | 1.0004 | 0.973 | 0.0 | 0.0 | 1.3795 | 1.3803 | | timm_efficientnet | 32 | 0.9541 | 0.8118 | 0.6972 | 0.0 | 1.3538 | 1.195 | | LearningToPaint | 96 | 1.0012 | 1.049 | 0.8596 | 0.0 | 1.2663 | 1.1859 | | pytorch_stargan | 16 | 0.9991 | 1.0766 | 0.933 | 0.0 | 1.2614 | 1.2286 | | resnet50 | 32 | 0.999 | 0.9921 | 0.7608 | 0.0 | 1.2048 | 1.1686 | | hf_Bart | 4 | 1.0124 | 0.973 | 0.7858 | 0.7878 | 1.2029 | 1.1957 | | pytorch_unet | 1 | 0.9997 | 0.9975 | 0.8467 | 0.0 | 1.202 | 1.186 | | hf_Bert | 4 | 1.0216 | 0.9963 | 0.7315 | 0.9151 | 1.2011 | 1.1818 | | Super_SloMo | 6 | 0.9999 | 0.9982 | 0.8674 | 1.0023 | 1.1813 | 1.1645 | | hf_DistilBert | 8 | 1.0008 | 0.9567 | 0.6866 | 0.5228 | 1.1729 | 1.1789 | | vgg16 | 64 | 0.9998 | 0.999 | 0.8595 | 0.9977 | 1.1722 | 1.1668 | | alexnet | 128 | 0.999 | 0.9971 | 0.8025 | 1.0043 | 1.1602 | 1.1631 | | hf_Reformer | 4 | 0.9984 | 1.0012 | 0.9881 | 0.0 | 1.1311 | 1.14 | | timm_regnet | 32 | 0.9637 | 0.9603 | 0.7797 | 0.0 | 1.126 | 1.0908 | | Background_Matting | 4 | 1.0001 | 1.0212 | 0.8682 | 0.0 | 1.1155 | 1.1072 | | yolov3 | 16 | 1.0 | 0.9945 | 0.7916 | 1.2029 | 1.0913 | 1.0786 | | hf_BigBird | 2 | 0.9873 | 0.9345 | 0.9709 | 0.9006 | 1.0887 | 0.9962 | | attention_is_all_you_need_pytorch | 256 | 1.0003 | 0.968 | 0.756 | 0.9804 | 1.0642 | 1.0483 | | timm_vision_transformer_large | 8 | 0.9993 | 0.9953 | 0.0 | 0.976 | 1.0492 | 1.0361 | | timm_vovnet | 32 | 0.9089 | 0.9042 | 0.7153 | 0.0 | 1.007 | 1.0165 | | tts_angular | 64 | 0.9884 | 0.9598 | 0.9853 | 0.9695 | 1.0069 | 1.0177 | | demucs | 4 | 0.9995 | 0.9998 | 0.9996 | 1.0002 | 1.0002 | 1.0002 | | nvidia_deeprecommender | 256 | 0.9987 | 0.963 | 0.5847 | 0.976 | 0.9036 | 0.9642 | | dlrm | 2048 | 0.0 | 1.0515 | 0.0 | 0.9973 | 0.0 | 1.0444 | | hf_GPT2_large | 4 | 0.9991 | 0.9798 | 0.0 | 0.5989 | 0.0 | 1.4742 | | hf_T5 | 8 | 0.9993 | 0.953 | 0.0 | 1.247 | 0.0 | 1.5685 | | tacotron2 | 64 | 0.9754 | 0.8418 | 0.0 | 0.0 | 0.0 | 0.9028 | | hf_Longformer | 2 | 0.9473 | 0.8798 | 0.8034 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8614 | 7.0158 | 10.0377 | 109.6599 | 371.9531 | 363.8208 | | timm_efficientdet | 1 | 19.224 | 33.2178 | 66.224 | nan | 122.8743 | 119.0122 | | hf_T5_large | 2 | 13.8547 | 35.3758 | nan | 426.7214 | 102.4023 | 100.2926 | | timm_vision_transformer_large | 8 | 2.2387 | 11.1578 | nan | 253.043 | 50.511 | 49.0287 | | attention_is_all_you_need_pytorch | 256 | 1.1049 | 5.4952 | 8.92 | 108.9814 | 45.5129 | 44.4213 | | densenet121 | 4 | 2.0417 | 9.6272 | 15.6198 | nan | 41.6513 | 40.5756 | | timm_resnest | 32 | 0.5392 | 2.0095 | 3.0833 | nan | 39.8511 | 38.54 | | hf_BigBird | 2 | 7.4753 | 12.9008 | 25.7752 | 84.9528 | 37.7178 | 25.416 | | timm_vision_transformer | 8 | 0.7547 | 3.4535 | 4.9656 | 61.655 | 32.2756 | 29.7435 | | hf_Bart | 4 | 1.573 | 6.4352 | 10.845 | 118.5196 | 28.5612 | 27.4618 | | timm_nfnet | 128 | 1.914 | 6.2307 | nan | 131.6158 | 27.2858 | 27.0435 | | BERT_pytorch | 16 | 1.4301 | 5.9278 | 8.9954 | 83.4438 | 26.7428 | 26.3124 | | pytorch_stargan | 16 | 0.3876 | 1.7235 | 2.509 | nan | 26.573 | 26.3066 | | resnet50_quantized_qat | 32 | 1.1032 | 7.0465 | nan | nan | 26.3433 | 26.4722 | | mobilenet_v2_quantized_qat | 96 | 1.2571 | 7.2017 | nan | nan | 25.989 | 25.9592 | | fastNLP_Bert | 6 | 1.4423 | 5.23 | 9.1513 | 88.2481 | 25.6569 | 24.2144 | | speech_transformer | 32 | 1.607 | 6.8204 | 25.7941 | 117.8391 | 25.4129 | 25.0411 | | timm_regnet | 32 | 2.2012 | 6.5009 | 17.8336 | nan | 23.0356 | 22.88 | | mobilenet_v3_large | 32 | 0.8264 | 3.8889 | 5.7435 | nan | 22.7694 | 22.1405 | | timm_efficientnet | 32 | 1.6793 | 5.6688 | 13.8038 | nan | 22.1784 | 21.7219 | | pytorch_struct | 200 | 0.2413 | 0.6161 | 1.1654 | 4.0189 | 19.5008 | 18.2188 | | hf_Reformer | 4 | 1.6925 | 2.885 | 5.6044 | nan | 19.2174 | 15.9965 | | hf_Bert | 4 | 1.5142 | 5.2937 | 7.9301 | 89.0286 | 18.2225 | 17.5742 | | mnasnet1_0 | 32 | 0.763 | 3.4271 | 5.2587 | nan | 18.0671 | 17.6162 | | shufflenet_v2_x1_0 | 128 | 0.9168 | 4.0663 | 6.2239 | nan | 17.7175 | 16.8712 | | timm_vovnet | 32 | 1.4409 | 3.7788 | 8.8736 | nan | 17.5028 | 17.2754 | | resnet50 | 32 | 0.8201 | 3.7567 | 5.5967 | nan | 17.4673 | 16.9844 | | hf_Albert | 8 | 1.1841 | 4.5928 | 7.5068 | 103.8845 | 17.215 | 16.4293 | | resnext50_32x4d | 8 | 0.8406 | 3.7221 | 5.762 | nan | 16.9006 | 16.3333 | | hf_GPT2 | 4 | 1.4463 | 5.1416 | 7.63 | 69.0378 | 16.7157 | 16.1243 | | Super_SloMo | 6 | 0.9714 | 3.9762 | 5.5713 | 32.2723 | 16.4381 | 15.5588 | | Background_Matting | 4 | 0.6921 | 3.5676 | 5.501 | nan | 15.9924 | 15.0031 | | mobilenet_v2 | 96 | 0.7311 | 3.7079 | 5.8611 | nan | 15.8456 | 16.0991 | | functorch_dp_cifar10 | 64 | 0.3423 | 1.3407 | 2.0217 | nan | 12.2127 | 12.3331 | | hf_DistilBert | 8 | 0.6109 | 2.5533 | 4.5332 | 40.5139 | 11.6684 | 11.4337 | | resnet18 | 16 | 0.3851 | 1.4827 | 2.1284 | nan | 10.6175 | 10.2896 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3667 | 1.5555 | 2.2852 | 30.6837 | 7.8733 | 7.6621 | | pytorch_unet | 1 | 0.4249 | 1.6126 | 2.4816 | nan | 7.7689 | 7.4441 | | LearningToPaint | 96 | 0.4226 | 1.5308 | 2.3427 | nan | 6.8033 | 6.6982 | | squeezenet1_1 | 32 | 0.1909 | 0.659 | 1.0197 | 4.2894 | 3.9135 | 3.5092 | | drq | 1 | 0.2866 | 0.5031 | 0.8449 | 4.0736 | 3.653 | 3.3213 | | soft_actor_critic | 256 | 0.2006 | 0.2947 | 0.5216 | 1.515 | 3.364 | 2.8142 | | vgg16 | 64 | 0.186 | 0.4632 | 0.8377 | 2.7182 | 3.3332 | 3.2707 | | nvidia_deeprecommender | 256 | 0.1909 | 0.3714 | 0.6361 | 4.5277 | 3.213 | 2.9493 | | alexnet | 128 | 0.1474 | 0.3139 | 0.5577 | 2.9115 | 2.8864 | 2.6028 | | dcgan | 32 | 0.1651 | 0.3577 | 0.5697 | 4.2487 | 2.5997 | 2.3809 | | lennard_jones | 1000 | 0.1361 | 0.2436 | 0.3939 | 1.2155 | 2.0081 | 1.7488 | | tts_angular | 64 | 0.2053 | 0.2465 | 0.3741 | 1.0179 | 1.8876 | 1.7878 | | demucs | 4 | 0.2968 | 0.2938 | 0.3021 | 0.2903 | 0.204 | 0.2033 | | tacotron2 | 64 | 17.3452 | 29.1381 | nan | nan | nan | 63.1371 | | hf_GPT2_large | 4 | 5.1006 | 15.8775 | nan | 231.6096 | nan | 41.0449 | | hf_T5 | 8 | 2.4009 | 7.6274 | nan | 67.3711 | nan | 26.4199 | | dlrm | 2048 | nan | 0.7163 | nan | 2.7078 | nan | 2.9103 | | hf_Longformer | 2 | 6.1844 | 12.9431 | 57.4587 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.5819 | 1.5819 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4874 | 1.4867 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | nan | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3631 | 0.9528 | 1.2027 | 1.4002 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | nan | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.289 | nan | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.8136 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2708 | 0.8942 | 1.0389 | 1.0454 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.8898 | 1.0219 | 1.0963 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | nan | 0.9832 | 1.0394 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | nan | 0.9814 | 1.0418 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | nan | 0.9406 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8182 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3915 | 0.8952 | 0.9169 | 0.9991 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | nan | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.8807 | 0.3236 | nan | 0.8982 | 1.0018 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 0.7425 | 0.8836 | 1.2212 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | nan | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | 0.8425 | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | 0.8374 | 0.8622 | 1.0312 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | nan | 0.8564 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | nan | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | nan | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.906 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8772 | 0.3385 | 0.8568 | 0.8318 | 1.1277 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | nan | 0.8302 | 0.8356 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3505 | 0.8837 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4299 | 0.9608 | 0.8211 | 1.0391 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | nan | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3313 | 0.8772 | 0.7517 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7449 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 0.8929 | 0.7062 | 1.0016 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | nan | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9471 | 0.7168 | 0.3387 | nan | 0.6889 | 0.916 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3213 | 0.887 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | nan | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9139 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | nan | nan | 1.1623 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8724 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | 0.876 | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | 0.7305 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2944 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0344 | 0.8988 | 1.7609 | 0.7669 | 3.2462 | 1.4282 | | CamemBert | 1 | 1.0489 | 0.9111 | 1.3153 | 0.7487 | 2.3839 | 1.4892 | | MT5ForConditionalGeneration | 8 | 1.0249 | 0.9058 | 1.197 | 1.0478 | 2.2642 | 1.9968 | | DistillGPT2 | 1 | 1.0362 | 0.9281 | 1.0569 | 0.2843 | 2.1735 | 1.7704 | | MobileBertForMaskedLM | 32 | 1.0219 | 0.9277 | 1.1471 | 0.0 | 2.1432 | 1.5437 | | GoogleFnet | 1 | 0.9781 | 0.7916 | 0.9608 | 0.6787 | 1.8333 | 1.1417 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9777 | 0.0 | 0.7332 | 1.796 | 1.7868 | | M2M100ForConditionalGeneration | 8 | 1.1668 | 0.8916 | 0.8688 | 0.8792 | 1.4677 | 1.3152 | | T5ForConditionalGeneration | 4 | 1.0045 | 0.9328 | 0.7238 | 1.1659 | 1.4575 | 1.4377 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.984 | 0.0 | 1.2717 | 1.4259 | 1.4061 | | ElectraForCausalLM | 32 | 1.0002 | 0.9308 | 0.0 | 1.0449 | 1.4126 | 1.447 | | MobileBertForQuestionAnswering | 64 | 1.0269 | 0.899 | 0.8661 | 0.0 | 1.4009 | 1.3149 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9888 | 0.7371 | 1.1677 | 1.3004 | 1.2892 | | T5Small | 1 | 1.0283 | 0.898 | 1.0214 | 1.0075 | 1.2743 | 1.1416 | | AlbertForQuestionAnswering | 4 | 1.0013 | 1.0016 | 0.0 | 1.2136 | 1.2615 | 1.259 | | AlbertForMaskedLM | 4 | 1.0002 | 0.9995 | 0.0 | 1.2086 | 1.2555 | 1.2542 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9694 | 0.0 | 1.0981 | 1.2117 | 1.2128 | | PLBartForConditionalGeneration | 16 | 1.0171 | 0.9677 | 0.82 | 0.8295 | 1.2074 | 1.2039 | | OPTForCausalLM | 32 | 1.001 | 0.9321 | 0.7133 | 0.4583 | 1.1814 | 1.2322 | | XGLMForCausalLM | 8 | 1.0134 | 0.8793 | 0.7416 | 0.3262 | 1.1703 | 1.183 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.985 | 0.713 | 0.5283 | 1.1701 | 1.151 | | RobertaForCausalLM | 64 | 1.0005 | 0.9613 | 0.7458 | 0.9897 | 1.1479 | 1.1508 | | MegatronBertForQuestionAnswering | 16 | 1.0391 | 1.0134 | 0.7678 | 0.904 | 1.1423 | 1.1242 | | Speech2Text2ForCausalLM | 128 | 0.9987 | 0.9247 | 0.6616 | 0.9473 | 1.1342 | 1.152 | | MegatronBertForCausalLM | 16 | 1.0352 | 1.0109 | 0.7389 | 0.9715 | 1.1289 | 1.1169 | | BertForQuestionAnswering | 128 | 1.0003 | 0.9934 | 0.0 | 1.0534 | 1.1144 | 1.1076 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9929 | 0.0 | 1.0538 | 1.1124 | 1.1142 | | BartForConditionalGeneration | 2 | 1.0002 | 0.9869 | 0.0 | 0.4455 | 1.1005 | 1.0887 | | BartForCausalLM | 4 | 1.0008 | 0.9659 | 0.7558 | 1.0034 | 1.0903 | 1.1102 | | BigBird | 1 | 0.9842 | 0.9253 | 0.9888 | 0.8937 | 1.0902 | 0.9951 | | PegasusForConditionalGeneration | 16 | 1.01 | 0.9642 | 0.7552 | 0.9091 | 1.0885 | 1.0682 | | MBartForConditionalGeneration | 16 | 1.0101 | 0.9844 | 0.7644 | 0.9354 | 1.0882 | 1.1586 | | DebertaForMaskedLM | 4 | 0.9045 | 0.7846 | 0.723 | 0.6431 | 1.0785 | 1.0406 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9255 | 0.0 | 0.9561 | 1.0642 | 1.0726 | | BertForMaskedLM | 64 | 1.0001 | 0.9609 | 0.7301 | 0.9877 | 1.0587 | 1.0605 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.9507 | 0.7124 | 0.618 | 1.0496 | 1.0677 | | DebertaForQuestionAnswering | 8 | 0.996 | 0.966 | 0.6825 | 0.8678 | 1.0489 | 1.2207 | | PLBartForCausalLM | 32 | 1.0063 | 0.9333 | 0.718 | 0.9233 | 1.0279 | 1.0546 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9104 | 0.6832 | 0.9228 | 1.0063 | 1.043 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9558 | 0.7333 | 0.9509 | 1.0037 | 1.014 | | MBartForCausalLM | 32 | 1.0004 | 0.9539 | 0.7319 | 0.956 | 0.9984 | 1.0098 | | PegasusForCausalLM | 32 | 0.9994 | 0.9522 | 0.7318 | 0.9518 | 0.991 | 1.0027 | | AllenaiLongformerBase | 1 | 0.9248 | 0.8421 | 0.7665 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BigBird | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8696 | 10.4793 | 34.4488 | 80.102 | 95.1799 | 33.9856 | | DebertaForMaskedLM | 4 | 4.8903 | 10.1684 | 39.0136 | 82.2904 | 89.5165 | 32.8486 | | XGLMForCausalLM | 8 | 2.4373 | 10.1151 | 22.2057 | 184.0347 | 67.472 | 64.5787 | | M2M100ForConditionalGeneration | 8 | 2.6255 | 12.8551 | 20.3877 | 240.1962 | 50.6846 | 53.7745 | | MobileBertForMaskedLM | 32 | 8.2909 | 24.9714 | 41.3933 | nan | 48.7079 | 47.9116 | | MobileBertForQuestionAnswering | 64 | 8.3929 | 23.8292 | 41.2316 | nan | 48.0873 | 48.1985 | | BartForConditionalGeneration | 2 | 3.0164 | 12.3913 | nan | 261.2649 | 43.1544 | 40.7754 | | PegasusForConditionalGeneration | 16 | 2.8098 | 12.217 | 20.3838 | 266.7147 | 42.5357 | 39.3584 | | MBartForConditionalGeneration | 16 | 3.0064 | 12.8568 | 22.1754 | 271.3029 | 41.265 | 39.9633 | | YituTechConvBert | 1 | 2.2847 | 8.4615 | 12.8935 | 128.5077 | 39.1252 | 36.8927 | | BigBird | 1 | 7.4673 | 13.2271 | 25.8571 | 97.2564 | 37.3978 | 24.4872 | | MegatronBertForCausalLM | 16 | 3.25 | 10.8935 | 16.6483 | 190.2921 | 32.5107 | 31.4699 | | MegatronBertForQuestionAnswering | 16 | 3.2629 | 10.8829 | 17.1363 | 188.8132 | 32.2158 | 30.6754 | | MT5ForConditionalGeneration | 8 | 3.7736 | 11.2854 | 17.9664 | 104.4138 | 31.3498 | 30.5518 | | T5ForConditionalGeneration | 4 | 2.4031 | 8.0927 | 12.7737 | 67.9725 | 29.6106 | 28.1192 | | BlenderbotSmallForConditionalGeneration | 64 | 1.9057 | 8.3398 | nan | 164.3311 | 28.9149 | 27.9222 | | T5Small | 1 | 2.4009 | 7.7054 | 11.553 | 70.5699 | 28.2884 | 27.324 | | LayoutLMForSequenceClassification | 16 | 1.8371 | 5.7627 | 9.2001 | 90.5694 | 27.2105 | 25.9046 | | PLBartForConditionalGeneration | 16 | 1.6054 | 6.6586 | 10.115 | 117.193 | 25.7334 | 25.1247 | | ElectraForCausalLM | 32 | 1.5128 | 5.4868 | nan | 88.7785 | 25.6426 | 23.597 | | PegasusForCausalLM | 32 | 1.1507 | 4.9241 | 7.9631 | 86.0692 | 21.1082 | 19.9817 | | MBartForCausalLM | 32 | 1.1314 | 4.719 | 7.6295 | 89.0267 | 20.6058 | 20.1791 | | GoogleFnet | 1 | 0.9536 | 2.926 | 9.0179 | 70.125 | 20.3296 | 13.4172 | | LayoutLMForMaskedLM | 16 | 1.9758 | 5.8564 | nan | 87.4187 | 20.3206 | 19.4557 | | BertForMaskedLM | 64 | 1.5049 | 5.2893 | 7.9608 | 90.3134 | 19.7607 | 19.0687 | | TrOCRForCausalLM | 32 | 1.1652 | 4.9065 | 7.5491 | 89.377 | 19.5229 | 18.252 | | ElectraForQuestionAnswering | 64 | 1.495 | 5.3576 | nan | 87.376 | 19.2805 | 18.7191 | | RobertaForCausalLM | 64 | 1.4981 | 5.9284 | 8.172 | 90.7714 | 19.2058 | 18.4299 | | BertForQuestionAnswering | 128 | 1.4996 | 5.37 | nan | 86.7613 | 19.0338 | 18.2877 | | BartForCausalLM | 4 | 1.2393 | 4.7412 | 7.3513 | 89.429 | 18.9341 | 18.4051 | | RobertaForQuestionAnswering | 128 | 1.5276 | 5.5296 | nan | 89.6935 | 18.2219 | 17.4613 | | CamemBert | 1 | 1.5741 | 5.4813 | 7.5863 | 97.4246 | 17.7886 | 18.1791 | | OPTForCausalLM | 32 | 1.2069 | 4.8382 | 9.4313 | 85.7846 | 17.089 | 16.6391 | | GPT2ForSequenceClassification | 4 | 1.4922 | 5.3664 | nan | 70.7582 | 16.288 | 15.8247 | | AlbertForMaskedLM | 4 | 1.2941 | 4.7028 | nan | 103.089 | 16.2048 | 15.0298 | | AlbertForQuestionAnswering | 4 | 1.2907 | 4.7446 | nan | 100.8213 | 15.793 | 14.9597 | | Speech2Text2ForCausalLM | 128 | 0.7228 | 2.6601 | 4.147 | 36.6383 | 14.6927 | 13.351 | | BlenderbotSmallForCausalLM | 64 | 0.7996 | 3.2077 | 4.9571 | 54.4288 | 14.2352 | 13.6993 | | PLBartForCausalLM | 32 | 0.6579 | 2.7688 | 3.8771 | 42.7143 | 13.2429 | 13.0004 | | DistillGPT2 | 1 | 0.8116 | 2.6374 | 3.9301 | 39.9989 | 12.4299 | 12.0662 | | DistilBertForMaskedLM | 64 | 0.6267 | 2.6339 | 4.5363 | 42.5658 | 11.315 | 10.7331 | | DistilBertForQuestionAnswering | 64 | 0.6283 | 2.6887 | 4.4822 | 39.0327 | 10.7323 | 10.169 | | AllenaiLongformerBase | 1 | 6.2745 | 13.2036 | 57.3764 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8955 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.5681 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.8577 | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 0.9642 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.5667 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9093 | 0.8215 | 1.1049 | | BigBird | 1 | 0.9979 | 0.9536 | 0.4208 | 0.9117 | 0.821 | 1.0085 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | M2M100ForConditionalGeneration | 8 | 1.0217 | 0.9507 | 0.3799 | 0.9742 | 0.8138 | 1.0093 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3769 | 0.76 | 0.8057 | 0.9257 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7909 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9863 | 0.8573 | 0.3681 | 0.8286 | 0.7888 | 0.8725 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9232 | 0.7774 | 0.931 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3715 | 0.9293 | 0.7698 | 0.9372 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4151 | 0.8853 | 0.763 | 0.9406 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.8549 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | 0.7949 | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.861 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.3581 | 0.872 | 0.7238 | 0.9373 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.8566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.9204 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.8713 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.8956 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8401 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.9357 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8975 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.8883 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 0.89 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.8898 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.8765 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8657 | 0.3606 | 0.7895 | 0.6764 | 0.8848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.8016 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.855 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.8538 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9843 | 0.3552 | 0.9262 | 0.386 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3208 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9731 | 0.8183 | 0.0 | 1.8718 | 1.8284 | | lcnet_050 | 128 | 0.9558 | 0.9489 | 0.7699 | 1.3477 | 1.6601 | 1.6228 | | regnety_002 | 128 | 0.9757 | 1.0017 | 0.8619 | 0.0 | 1.4928 | 1.3259 | | dm_nfnet_f0 | 128 | 0.9999 | 0.9997 | 0.0 | 1.2524 | 1.4716 | 1.4239 | | xcit_large_24_p8_224 | 5 | 1.0025 | 0.9839 | 0.7787 | 0.0 | 1.4359 | 1.3257 | | hrnet_w18 | 128 | 0.9999 | 0.9983 | 0.0 | 0.0 | 1.4165 | 1.3777 | | dla102 | 128 | 0.9999 | 1.0006 | 0.0 | 0.0 | 1.3836 | 1.3692 | | volo_d1_224 | 64 | 1.0 | 0.9945 | 0.802 | 0.0 | 1.3817 | 1.36 | | nfnet_l0 | 128 | 0.9996 | 0.789 | 0.0 | 1.2306 | 1.3724 | 1.3282 | | res2net50_14w_8s | 128 | 0.9998 | 0.9992 | 0.0 | 0.0 | 1.3566 | 1.3244 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.9618 | 0.7658 | 0.0 | 1.3373 | 1.3431 | | mobilenetv2_100 | 128 | 0.9647 | 0.9637 | 0.7075 | 0.0 | 1.3369 | 1.354 | | coat_lite_mini | 128 | 0.9999 | 0.9834 | 0.8344 | 1.1056 | 1.333 | 1.3212 | | inception_v3 | 128 | 0.9999 | 0.996 | 0.0 | 0.0 | 1.3299 | 1.3084 | | gluon_inception_v3 | 128 | 0.9999 | 0.9984 | 0.0 | 0.0 | 1.3281 | 1.3084 | | adv_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 0.0 | 1.3237 | 1.3076 | | crossvit_9_240 | 128 | 0.9997 | 0.9982 | 0.7599 | 1.0529 | 1.3213 | 1.3008 | | resnest101e | 64 | 0.9996 | 1.003 | 0.0 | 0.0 | 1.3157 | 1.2707 | | res2next50 | 128 | 0.9999 | 1.0007 | 0.0 | 0.0 | 1.3098 | 1.2736 | | jx_nest_base | 32 | 1.0003 | 0.9955 | 0.7311 | 0.0 | 1.2777 | 1.2486 | | fbnetv3_b | 128 | 0.9642 | 0.9607 | 0.7578 | 0.0 | 1.2759 | 1.2981 | | sebotnet33ts_256 | 64 | 0.9758 | 0.803 | 0.0 | 0.0 | 1.2673 | 1.2692 | | selecsls42b | 128 | 0.9999 | 0.9988 | 0.8164 | 0.0 | 1.2673 | 1.2531 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7712 | 0.0 | 0.0 | 1.2659 | 1.2526 | | gmixer_24_224 | 128 | 0.9999 | 0.8097 | 0.0 | 1.0484 | 1.2617 | 1.2341 | | eca_halonext26ts | 128 | 0.9871 | 0.7786 | 0.0 | 0.0 | 1.2592 | 1.244 | | botnet26t_256 | 128 | 0.9856 | 0.9814 | 0.7881 | 0.0 | 1.2575 | 1.2606 | | mnasnet_100 | 128 | 0.966 | 0.9637 | 0.7877 | 0.0 | 1.2555 | 1.2822 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7831 | 0.0 | 0.0 | 1.2551 | 1.2683 | | fbnetc_100 | 128 | 0.9669 | 0.9628 | 0.7918 | 0.0 | 1.2497 | 1.2646 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.9776 | 0.7447 | 0.0 | 1.2409 | 1.2475 | | spnasnet_100 | 128 | 0.961 | 0.9576 | 0.775 | 0.0 | 1.2373 | 1.253 | | res2net101_26w_4s | 64 | 0.9999 | 0.9971 | 0.7756 | 0.0 | 1.2236 | 1.1884 | | convit_base | 64 | 0.9997 | 0.9981 | 0.0 | 1.3105 | 1.2196 | 1.2094 | | rexnet_100 | 128 | 0.9732 | 0.8157 | 0.0 | 0.0 | 1.212 | 1.2191 | | cspdarknet53 | 64 | 0.9582 | 0.9523 | 0.737 | 1.2258 | 1.2104 | 1.2375 | | pnasnet5large | 16 | 0.9996 | 0.9982 | 0.0 | 0.0 | 1.2101 | 1.1942 | | twins_pcpvt_base | 64 | 1.0 | 0.9981 | 0.7489 | 1.0218 | 1.2084 | 1.1684 | | gmlp_s16_224 | 128 | 1.0 | 0.9493 | 0.0 | 1.0772 | 1.2002 | 1.1894 | | tinynet_a | 128 | 0.966 | 0.7753 | 0.6219 | 0.0 | 1.1899 | 1.194 | | dpn107 | 32 | 0.9577 | 0.9506 | 0.7805 | 0.0 | 1.1877 | 1.1992 | | pit_b_224 | 64 | 1.0003 | 0.9992 | 0.0 | 1.0508 | 1.1876 | 1.1775 | | cait_m36_384 | 4 | 1.0001 | 1.0266 | 0.0 | 1.0929 | 1.1807 | 1.157 | | repvgg_a2 | 128 | 0.964 | 0.9623 | 0.8285 | 1.1371 | 1.1713 | 1.1687 | | tf_mixnet_l | 128 | 0.9856 | 0.8896 | 0.0 | 0.0 | 1.1693 | 1.167 | | mobilevit_s | 64 | 0.9791 | 0.7621 | 0.0 | 0.0 | 1.1676 | 1.1689 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.0 | 0.0 | 1.1668 | 1.1468 | | mixnet_l | 128 | 0.9848 | 0.8855 | 0.0 | 0.0 | 1.1503 | 1.1485 | | swin_base_patch4_window7_224 | 64 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.1363 | 1.1333 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9823 | 0.0 | 0.9404 | 1.1137 | 1.1025 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1075 | 1.0713 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9984 | 0.7679 | 1.0025 | 1.0947 | 1.0821 | | gluon_xception65 | 32 | 0.9999 | 0.997 | 0.0 | 0.0 | 1.0869 | 1.0755 | | vit_base_patch16_224 | 64 | 1.0002 | 0.9981 | 0.7651 | 0.9715 | 1.0864 | 1.0709 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0776 | 1.0742 | | gernet_l | 128 | 0.9739 | 0.9725 | 0.8228 | 0.0 | 1.076 | 1.0708 | | convnext_base | 64 | 0.9999 | 0.9984 | 0.0 | 1.2056 | 1.074 | 1.0694 | | mixer_b16_224 | 128 | 1.0 | 0.9778 | 0.0 | 0.9032 | 1.0662 | 1.0611 | | visformer_small | 128 | 0.9996 | 1.0017 | 0.798 | 0.0 | 1.0471 | 1.0124 | | resmlp_12_224 | 128 | 0.9998 | 0.8547 | 0.612 | 1.0527 | 0.7921 | 0.8299 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.5428 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.6758 | 24.2259 | nan | nan | 97.9129 | 94.4011 | | swin_base_patch4_window7_224 | 64 | 2.5127 | 11.1487 | nan | nan | 74.4331 | 73.0858 | | mobilevit_s | 64 | 1.6771 | 5.9554 | nan | nan | 72.5534 | 70.5904 | | xcit_large_24_p8_224 | 5 | 2.5972 | 13.7943 | 26.3818 | nan | 72.1378 | 68.2345 | | pnasnet5large | 16 | 4.4234 | 18.195 | nan | nan | 70.2334 | 66.2853 | | twins_pcpvt_base | 64 | 2.2111 | 10.3305 | 18.7337 | 305.7279 | 61.7391 | 61.8269 | | cait_m36_384 | 4 | 2.6499 | 14.2511 | nan | 341.4789 | 60.2508 | 58.4439 | | convnext_base | 64 | 1.1844 | 5.1544 | nan | 114.4446 | 59.2765 | 58.0018 | | resnest101e | 64 | 3.1624 | 12.8703 | nan | nan | 55.191 | 53.935 | | jx_nest_base | 32 | 1.7818 | 7.4274 | 13.427 | nan | 53.2076 | 50.6647 | | res2net101_26w_4s | 64 | 2.981 | 13.2332 | 22.8468 | nan | 52.9602 | 48.8257 | | res2net50_14w_8s | 128 | 2.5697 | 12.0241 | nan | nan | 47.7386 | 44.5669 | | coat_lite_mini | 128 | 1.1213 | 4.1486 | 6.5904 | 85.8414 | 47.2758 | 47.0832 | | sebotnet33ts_256 | 64 | 1.5707 | 5.4185 | nan | nan | 46.5574 | 45.5205 | | eca_halonext26ts | 128 | 1.4776 | 4.5311 | nan | nan | 46.4726 | 45.7912 | | poolformer_m36 | 64 | 1.8539 | 7.0827 | nan | nan | 43.8955 | 43.9633 | | gmlp_s16_224 | 128 | 0.9854 | 5.1687 | nan | 119.5521 | 39.3374 | 37.3765 | | eca_botnext26ts_256 | 128 | 1.3412 | 4.4601 | nan | nan | 38.723 | 37.7321 | | dpn107 | 32 | 3.7593 | 11.499 | 35.9943 | nan | 37.6314 | 35.6378 | | fbnetv3_b | 128 | 2.9909 | 9.2712 | 25.5834 | nan | 37.0176 | 32.9729 | | crossvit_9_240 | 128 | 1.3783 | 6.3515 | 10.4875 | 151.9715 | 36.5609 | 34.4028 | | botnet26t_256 | 128 | 1.3216 | 3.6816 | 8.3269 | nan | 35.2519 | 35.0131 | | volo_d1_224 | 64 | 1.3955 | 6.0708 | 9.9995 | nan | 35.1971 | 32.658 | | gluon_xception65 | 32 | 1.7601 | 8.7117 | nan | nan | 34.1982 | 32.1738 | | adv_inception_v3 | 128 | 1.5995 | 6.837 | nan | nan | 32.8611 | 30.2135 | | inception_v3 | 128 | 1.5111 | 7.0072 | nan | nan | 31.8237 | 30.8443 | | gluon_inception_v3 | 128 | 1.502 | 6.873 | nan | nan | 31.6109 | 30.9949 | | ghostnet_100 | 128 | 2.6525 | 7.9379 | 12.7268 | nan | 31.1712 | 30.1409 | | tf_mixnet_l | 128 | 5.5719 | 11.2945 | nan | nan | 30.7085 | 29.4208 | | dla102 | 128 | 1.7017 | 7.6465 | nan | nan | 29.7943 | 28.6222 | | mixnet_l | 128 | 5.2959 | 10.8869 | nan | nan | 29.5401 | 28.9401 | | gmixer_24_224 | 128 | 1.0432 | 5.7894 | nan | 119.8018 | 29.2157 | 28.4857 | | swsl_resnext101_32x16d | 32 | 1.6284 | 7.482 | nan | nan | 28.6665 | 27.2489 | | dm_nfnet_f0 | 128 | 2.0469 | 6.6058 | nan | 131.7866 | 28.4855 | 27.7133 | | convit_base | 64 | 1.0715 | 4.7688 | nan | 99.9877 | 27.4687 | 26.505 | | res2next50 | 128 | 1.5744 | 6.7033 | nan | nan | 27.3919 | 25.8268 | | tinynet_a | 128 | 1.9908 | 6.5318 | 17.5592 | nan | 25.4807 | 24.1305 | | rexnet_100 | 128 | 1.8109 | 6.1602 | nan | nan | 25.4385 | 24.8808 | | tf_efficientnet_b0 | 128 | 1.7427 | 5.6416 | nan | nan | 22.605 | 21.0662 | | cspdarknet53 | 64 | 2.1923 | 6.3219 | 16.6104 | 111.791 | 22.3753 | 21.0157 | | resmlp_12_224 | 128 | 0.6079 | 2.4407 | 3.9406 | 29.6501 | 22.1366 | 20.9371 | | mixer_b16_224 | 128 | 0.6668 | 2.6879 | nan | 60.4097 | 22.033 | 20.1957 | | visformer_small | 128 | 0.927 | 3.4075 | 5.4637 | nan | 21.353 | 20.4765 | | nfnet_l0 | 128 | 1.7629 | 6.1974 | nan | 119.5436 | 21.0405 | 19.9757 | | convmixer_768_32 | 32 | 1.0919 | 4.9011 | nan | nan | 21.0102 | 19.7444 | | spnasnet_100 | 128 | 1.8763 | 5.353 | 14.956 | nan | 20.6172 | 19.5156 | | fbnetc_100 | 128 | 1.9551 | 5.5139 | 15.2878 | nan | 20.5856 | 19.8788 | | mobilenetv3_large_100 | 128 | 1.4853 | 4.5798 | 11.7251 | nan | 19.5646 | 18.995 | | beit_base_patch16_224 | 64 | 1.0998 | 4.2197 | nan | 76.8368 | 19.3602 | 18.6469 | | deit_base_distilled_patch16_224 | 64 | 0.8309 | 3.4963 | 5.8135 | 64.2137 | 19.3532 | 18.1886 | | mnasnet_100 | 128 | 1.5433 | 4.4024 | 11.6242 | nan | 18.6775 | 16.7912 | | vit_base_patch16_224 | 64 | 0.8307 | 3.4866 | 6.1571 | 62.9057 | 18.5383 | 17.9197 | | mobilenetv2_100 | 128 | 1.6797 | 4.5132 | 11.6029 | nan | 18.3415 | 17.3047 | | repvgg_a2 | 128 | 1.8844 | 5.2869 | 13.9984 | 216.8296 | 17.6041 | 16.9046 | | pit_b_224 | 64 | 0.9745 | 3.9159 | nan | 82.2597 | 17.3771 | 16.6796 | | gernet_l | 128 | 1.8878 | 5.0447 | 13.713 | nan | 17.0365 | 16.3314 | | regnety_002 | 128 | 1.5071 | 4.5384 | 11.3701 | nan | 16.9243 | 16.4111 | | selecsls42b | 128 | 0.8012 | 2.9711 | 4.8528 | nan | 15.2159 | 14.5701 | | lcnet_050 | 128 | 0.9774 | 2.8222 | 6.6583 | 67.5505 | 13.0877 | 12.0854 | | ese_vovnet19b_dw | 128 | 0.9845 | 2.526 | 5.9736 | nan | 12.4852 | 11.6486 | | tnt_s_patch16_224 | 128 | 1.546 | 8.1097 | nan | nan | nan | 31.7087 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 0.9166 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | nan | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.8322 | 1.2911 | 1.4945 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | nan | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | nan | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | nan | 1.1771 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | nan | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9938 | 0.7687 | nan | nan | 1.1376 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | 0.933 | 1.1184 | 1.1751 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0526 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.8897 | 1.0218 | 1.0961 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9286 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 0.9163 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | nan | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 0.9167 | 0.9926 | 1.0799 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8423 | 0.9924 | 1.0856 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | nan | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 0.8965 | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | nan | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9209 | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9831 | 0.9169 | nan | nan | 0.9632 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | nan | 0.952 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | nan | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | nan | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | nan | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | nan | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2863 | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | nan | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | nan | 0.9285 | 1.015 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0634 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0635 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9137 | 1.0634 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.8692 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | nan | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0515 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | nan | 0.9069 | 1.0618 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | nan | 0.9059 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8297 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | nan | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | nan | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | nan | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | nan | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | nan | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.7501 | 0.8563 | 1.0752 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7284 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | nan | 0.7928 | 0.9926 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6275 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.7257 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8762 | 0.7462 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 0.8418 | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/IFC0YLn.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/BeloVik.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/whIxO0a.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 51/53 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 52/53 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 89%, 47/53 | 90%, 38/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 55%, 29/53 | 93%, 39/42  | 28%, 17/61  |
|        inductor        | 85%, 45/53 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 89%, 47/53 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.10x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.04x    |    1.07x    |
|        inductor        |   1.91x    |    1.79x    |    1.41x    |
| inductor_no_cudagraphs |   1.37x    |    1.55x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.40    |    2.86     |    2.15     |
|       aot_eager        |    6.95    |    10.36    |    8.66     |
|     aot_cudagraphs     |   11.15    |    18.97    |    15.99    |
|    nvprims_nvfuser     |   93.96    |   144.92    |   150.66    |
|        inductor        |   32.77    |    34.87    |    40.37    |
| inductor_no_cudagraphs |   32.11    |    30.29    |    38.68    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.90x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.80x    |    0.78x    |    0.76x    |
|        inductor        |   0.85x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.96x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | dlrm | 0.0 | 1.1193 | | torchbench | hf_GPT2_large | 0.0 | 1.866 | | torchbench | tacotron2 | 0.0 | 0.8615 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6583 | 0.6514 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | yolov3 | 410.4643 | 400.3627 | | torchbench | timm_efficientdet | 135.7656 | 132.9004 | | torchbench | tacotron2 | nan | 65.4744 | | torchbench | hf_GPT2_large | nan | 52.0292 | | torchbench | dlrm | nan | 3.3719 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | hrnet_w18 | 122.489 | 114.4404 | | timm_models | eca_halonext26ts | nan | nan | +-------------+-----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | timm_vision_transformer_large | 0.879 | 0.9542 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.873 | 0.9426 | | torchbench | shufflenet_v2_x1_0 | 0.869 | 0.993 | | torchbench | fastNLP_Bert | 0.8661 | 1.0682 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8383 | 0.9051 | | torchbench | dcgan | 0.8283 | 0.9695 | | torchbench | hf_Bart | 0.8231 | 0.9873 | | torchbench | hf_BigBird | 0.8109 | 1.0963 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | pytorch_stargan | 0.7801 | 0.8859 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | resnext50_32x4d | 0.7647 | 0.775 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | soft_actor_critic | 0.7295 | 1.0367 | | torchbench | timm_vision_transformer | 0.7139 | 0.724 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | hf_Reformer | 0.5295 | 0.9885 | | torchbench | functorch_dp_cifar10 | 0.4478 | 0.4806 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | hf_GPT2_large | nan | 1.1354 | | torchbench | dlrm | nan | 0.7306 | | torchbench | tacotron2 | nan | 0.4113 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0179 | | huggingface | MegatronBertForCausalLM | 0.8918 | 1.0275 | | huggingface | PLBartForConditionalGeneration | 0.8848 | 1.028 | | huggingface | DistilBertForMaskedLM | 0.8803 | 0.948 | | huggingface | MT5ForConditionalGeneration | 0.8756 | 0.9197 | | huggingface | Speech2Text2ForCausalLM | 0.8691 | 0.9801 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | PLBartForCausalLM | 0.8546 | 0.9358 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | BigBird | 0.8172 | 1.0883 | | huggingface | CamemBert | 0.8062 | 0.9318 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | DistillGPT2 | 0.7997 | 1.0175 | | huggingface | M2M100ForConditionalGeneration | 0.7882 | 1.0174 | | huggingface | YituTechConvBert | 0.7879 | 0.9269 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9454 | | huggingface | MobileBertForQuestionAnswering | 0.6085 | 0.8221 | | huggingface | DebertaForMaskedLM | 0.4088 | 1.0667 | | huggingface | DebertaForQuestionAnswering | 0.307 | 1.1932 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | gluon_xception65 | 0.8973 | 0.9761 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | inception_v3 | 0.8973 | 1.0246 | | timm_models | gluon_inception_v3 | 0.8973 | 1.0246 | | timm_models | adv_inception_v3 | 0.8973 | 1.0246 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.8769 | 0.9736 | | timm_models | convnext_base | 0.8761 | 0.9863 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8703 | 1.0094 | | timm_models | gernet_l | 0.8617 | 0.9854 | | timm_models | cspdarknet53 | 0.8606 | 1.0104 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | coat_lite_mini | 0.8032 | 1.0344 | | timm_models | repvgg_a2 | 0.7908 | 0.9916 | | timm_models | resmlp_12_224 | 0.7876 | 0.8011 | | timm_models | swin_base_patch4_window7_224 | 0.7569 | 0.9259 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8292 | | timm_models | jx_nest_base | 0.6707 | 0.8617 | | timm_models | eca_halonext26ts | nan | nan | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/yZoqtXI.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/QmK89en.png)

Accuracy Regressions

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0038 | 0.912 | 2.5061 | 0.695 | 6.6353 | 1.3051 | | functorch_dp_cifar10 | 64 | 1.0013 | 0.9499 | 2.4411 | 0.0 | 5.0123 | 1.381 | | timm_efficientdet | 1 | 0.9844 | 0.8065 | 2.1265 | 0.0 | 4.7486 | 1.5275 | | BERT_pytorch | 16 | 1.0072 | 0.8322 | 1.6051 | 0.8831 | 3.5331 | 2.3736 | | resnext50_32x4d | 8 | 1.0023 | 0.9668 | 1.8695 | 0.0 | 3.4443 | 1.2652 | | mobilenet_v3_large | 32 | 1.0041 | 1.0069 | 1.4987 | 0.0 | 3.2945 | 1.4111 | | timm_vision_transformer | 8 | 1.0036 | 0.8562 | 1.8007 | 0.6308 | 3.261 | 1.5313 | | drq | 1 | 1.0058 | 0.815 | 1.9892 | 0.6111 | 2.9294 | 1.2021 | | resnet18 | 16 | 1.0052 | 1.0038 | 1.6106 | 0.7659 | 2.7994 | 1.2521 | | mnasnet1_0 | 32 | 0.9987 | 1.0218 | 1.3626 | 0.0 | 2.7677 | 1.3453 | | dcgan | 32 | 0.9883 | 0.9222 | 1.6634 | 0.6869 | 2.5968 | 1.072 | | hf_T5_large | 2 | 1.0217 | 0.8563 | 0.0 | 0.8399 | 2.5621 | 2.1525 | | squeezenet1_1 | 32 | 0.9954 | 0.9691 | 1.4672 | 0.7237 | 2.4601 | 1.3008 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.998 | 0.9854 | 1.8992 | 0.0 | 2.4365 | 1.5868 | | hf_Albert | 8 | 1.0003 | 0.9523 | 0.7728 | 1.0806 | 2.3843 | 2.3179 | | timm_efficientnet | 32 | 0.9596 | 0.8099 | 1.099 | 0.0 | 2.1384 | 1.2638 | | pytorch_struct | 200 | 0.9941 | 0.7418 | 1.0163 | 0.6228 | 2.1211 | 1.2964 | | hf_GPT2 | 4 | 1.0206 | 0.9731 | 0.8197 | 0.2923 | 2.0917 | 1.9201 | | lennard_jones | 1000 | 0.9695 | 0.7695 | 1.2866 | 0.4675 | 2.0758 | 1.0499 | | hf_Bert | 4 | 1.037 | 0.8536 | 1.1018 | 0.6789 | 2.0718 | 1.8384 | | LearningToPaint | 96 | 0.9998 | 1.011 | 1.1987 | 0.0 | 1.947 | 1.3119 | | timm_resnest | 32 | 1.0004 | 1.0247 | 0.8387 | 0.0 | 1.9302 | 1.6484 | | hf_T5 | 8 | 1.0017 | 0.9215 | 0.0 | 1.3656 | 1.8813 | 1.8866 | | resnet50 | 32 | 0.9997 | 1.0152 | 1.1397 | 0.0 | 1.7771 | 1.3521 | | hf_Bart | 4 | 1.0088 | 0.8376 | 0.8707 | 0.6669 | 1.7724 | 1.7493 | | attention_is_all_you_need_pytorch | 256 | 1.0098 | 0.8511 | 0.8353 | 0.7533 | 1.7554 | 1.5145 | | shufflenet_v2_x1_0 | 128 | 0.9999 | 1.02 | 1.0199 | 0.0 | 1.717 | 1.4083 | | speech_transformer | 32 | 1.0032 | 0.8432 | 1.9154 | 0.7294 | 1.6915 | 1.6993 | | soft_actor_critic | 256 | 0.9718 | 0.7442 | 1.3137 | 0.5413 | 1.6758 | 1.0354 | | mobilenet_v2 | 96 | 0.9996 | 0.9885 | 0.7606 | 0.0 | 1.5753 | 1.5152 | | fastNLP_Bert | 6 | 1.0001 | 0.8867 | 0.7655 | 0.7831 | 1.53 | 1.4732 | | hf_DistilBert | 8 | 0.9999 | 0.9683 | 0.764 | 0.3667 | 1.5119 | 1.4813 | | timm_nfnet | 128 | 0.9992 | 0.9999 | 0.8798 | 1.0948 | 1.501 | 1.4288 | | pytorch_stargan | 16 | 0.9994 | 1.0914 | 1.0406 | 0.0 | 1.4651 | 1.4118 | | pytorch_unet | 1 | 0.9994 | 0.9921 | 0.8634 | 0.0 | 1.342 | 1.3116 | | timm_regnet | 32 | 0.9736 | 0.9406 | 0.8911 | 0.0 | 1.3309 | 1.2205 | | timm_vovnet | 32 | 0.9198 | 0.8894 | 0.8697 | 0.0 | 1.3156 | 1.1442 | | Super_SloMo | 6 | 0.9993 | 0.9948 | 0.8848 | 0.0 | 1.2907 | 1.2581 | | vgg16 | 64 | 0.9995 | 0.9972 | 0.8572 | 0.9731 | 1.2721 | 1.2629 | | Background_Matting | 4 | 1.0005 | 1.0182 | 0.8945 | 0.0 | 1.2251 | 1.2086 | | alexnet | 128 | 0.998 | 0.9967 | 0.8151 | 0.9267 | 1.2103 | 1.2094 | | hf_Reformer | 4 | 0.9983 | 0.9995 | 0.9913 | 0.0 | 1.1768 | 1.1723 | | timm_vision_transformer_large | 8 | 1.0003 | 0.99 | 0.0 | 0.9205 | 1.1565 | 1.1318 | | hf_BigBird | 2 | 0.9928 | 0.9091 | 1.0543 | 0.8296 | 1.1414 | 1.0241 | | yolov3 | 16 | 0.9997 | 0.9896 | 0.8052 | 0.9987 | 1.0908 | 1.0652 | | tts_angular | 64 | 1.0055 | 0.9347 | 0.9866 | 0.9569 | 1.0266 | 1.0217 | | demucs | 4 | 1.0016 | 1.0012 | 0.9999 | 0.9992 | 1.001 | 1.0003 | | nvidia_deeprecommender | 256 | 0.9985 | 0.9961 | 0.6969 | 1.0078 | 0.9888 | 1.0308 | | dlrm | 2048 | 0.0 | 1.1268 | 0.0 | 1.0877 | 0.0 | 1.1193 | | hf_GPT2_large | 4 | 0.9997 | 0.9911 | 0.0 | 0.4192 | 0.0 | 1.866 | | tacotron2 | 64 | 0.978 | 0.7496 | 0.9891 | 0.0 | 0.0 | 0.8615 | | hf_Longformer | 2 | 0.9353 | 0.8561 | 0.8739 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | pass | pass | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9481 | 8.2822 | 11.8227 | 119.1098 | 410.4643 | 400.3627 | | timm_efficientdet | 1 | 19.8383 | 37.4231 | 80.0596 | nan | 135.7656 | 132.9004 | | hf_T5_large | 2 | 14.3151 | 39.6135 | nan | 514.4011 | 115.0878 | 113.5764 | | timm_vision_transformer_large | 8 | 2.9229 | 15.3386 | nan | 297.5342 | 69.5554 | 67.3802 | | densenet121 | 4 | 2.2615 | 11.8997 | 19.4809 | 230.9426 | 47.8924 | 47.4643 | | hf_BigBird | 2 | 8.5488 | 15.4394 | 30.5753 | 105.675 | 44.8745 | 29.8331 | | timm_resnest | 32 | 0.6109 | 2.5308 | 3.8723 | nan | 38.222 | 37.5522 | | attention_is_all_you_need_pytorch | 256 | 1.335 | 7.1807 | 11.2905 | 113.4868 | 34.7141 | 33.8898 | | timm_vision_transformer | 8 | 0.9568 | 4.6302 | 6.6929 | 68.0522 | 34.1115 | 34.5243 | | BERT_pytorch | 16 | 1.7639 | 7.6144 | 11.5299 | 104.8362 | 32.1658 | 31.616 | | hf_Bart | 4 | 1.9784 | 9.3029 | 13.5391 | 150.374 | 31.6095 | 31.6372 | | timm_nfnet | 128 | 2.1175 | 7.1782 | 10.6313 | 139.6021 | 30.8732 | 29.7873 | | hf_T5 | 8 | 2.615 | 8.6808 | nan | 80.5126 | 29.409 | 28.1122 | | fastNLP_Bert | 6 | 1.776 | 7.0942 | 11.4224 | 102.7015 | 28.6166 | 26.6439 | | timm_regnet | 32 | 2.3407 | 7.9656 | 19.7654 | nan | 27.1679 | 26.3271 | | speech_transformer | 32 | 1.9125 | 8.9583 | 33.8131 | 149.5198 | 26.7025 | 26.4637 | | pytorch_stargan | 16 | 0.4132 | 1.968 | 2.8943 | nan | 26.6567 | 25.6457 | | timm_efficientnet | 32 | 1.8793 | 6.7481 | 15.5941 | nan | 26.2496 | 25.5719 | | mobilenet_v3_large | 32 | 0.9783 | 4.8491 | 7.1975 | nan | 25.144 | 24.6038 | | pytorch_struct | 200 | 0.2827 | 0.8594 | 1.4781 | 6.381 | 22.4334 | 22.0491 | | mnasnet1_0 | 32 | 0.8871 | 4.3308 | 6.6985 | nan | 20.8956 | 20.1469 | | hf_Bert | 4 | 1.7678 | 7.0695 | 10.5334 | 99.2513 | 20.5869 | 20.014 | | resnet50 | 32 | 0.9379 | 4.5869 | 6.9772 | nan | 20.2211 | 19.4908 | | hf_Reformer | 4 | 1.824 | 3.1907 | 6.1846 | nan | 20.058 | 16.4786 | | hf_GPT2 | 4 | 1.7463 | 7.0081 | 9.6515 | 70.2554 | 19.9606 | 19.2607 | | shufflenet_v2_x1_0 | 128 | 1.0513 | 4.9913 | 7.7668 | nan | 19.9573 | 19.9113 | | timm_vovnet | 32 | 1.5382 | 4.3772 | 10.0472 | nan | 19.9005 | 19.1675 | | hf_Albert | 8 | 1.5834 | 6.5762 | 10.0956 | 98.0025 | 19.8428 | 19.2511 | | resnext50_32x4d | 8 | 0.9527 | 4.6049 | 6.7458 | nan | 19.4132 | 19.1243 | | mobilenet_v2 | 96 | 0.8774 | 4.5304 | 7.0024 | nan | 19.3353 | 18.5796 | | Super_SloMo | 6 | 1.0811 | 4.711 | 6.5154 | nan | 19.2612 | 18.904 | | Background_Matting | 4 | 0.9236 | 4.5629 | 7.0046 | nan | 18.8968 | 18.0019 | | functorch_dp_cifar10 | 64 | 0.3894 | 1.6572 | 2.4397 | nan | 13.485 | 13.2996 | | hf_DistilBert | 8 | 0.7932 | 3.4676 | 5.9585 | 44.7378 | 13.4501 | 13.0961 | | resnet18 | 16 | 0.4811 | 1.8025 | 2.6114 | 40.1107 | 11.2817 | 11.1454 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4399 | 1.9751 | 2.8864 | nan | 9.5678 | 8.8943 | | pytorch_unet | 1 | 0.5019 | 2.0157 | 3.0054 | nan | 9.2225 | 8.7948 | | LearningToPaint | 96 | 0.4648 | 1.909 | 2.8823 | nan | 8.3285 | 7.9621 | | squeezenet1_1 | 32 | 0.2613 | 0.9357 | 1.391 | 6.7532 | 4.8039 | 4.7075 | | drq | 1 | 0.3205 | 0.6312 | 1.031 | 6.0128 | 4.4697 | 3.7494 | | vgg16 | 64 | 0.2021 | 0.6554 | 1.1072 | 4.9758 | 4.1522 | 3.9619 | | nvidia_deeprecommender | 256 | 0.2206 | 0.5269 | 0.8279 | 5.2867 | 3.6633 | 3.4563 | | soft_actor_critic | 256 | 0.2135 | 0.3641 | 0.5881 | 3.111 | 3.6476 | 3.0791 | | alexnet | 128 | 0.169 | 0.4415 | 0.7633 | 4.7645 | 3.3909 | 3.2214 | | dcgan | 32 | 0.1886 | 0.4289 | 0.7134 | 5.2067 | 2.9472 | 2.7565 | | lennard_jones | 1000 | 0.1588 | 0.3457 | 0.5259 | 2.8209 | 2.222 | 2.0264 | | tts_angular | 64 | 0.2299 | 0.2766 | 0.4063 | 1.4765 | 2.0622 | 1.826 | | demucs | 4 | 0.369 | 0.3495 | 0.3498 | 0.3535 | 0.2577 | 0.2623 | | tacotron2 | 64 | 17.9601 | 33.1348 | 52.209 | nan | nan | 65.4744 | | hf_GPT2_large | 4 | 5.6844 | 21.0561 | nan | 331.6675 | nan | 52.0292 | | dlrm | 2048 | nan | 0.8111 | nan | 4.7052 | nan | 3.3719 | | hf_Longformer | 2 | 6.4308 | 14.5241 | 57.5854 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2718 | nan | 1.2042 | 1.2318 | | hf_Albert | 8 | 0.9813 | 0.9356 | 0.3267 | 0.8454 | 1.1573 | 1.4691 | | speech_transformer | 32 | 1.0001 | 0.9165 | 0.331 | 0.8871 | 1.1082 | 1.116 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | nan | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0021 | 0.9646 | 0.3844 | nan | 1.0542 | 1.2945 | | timm_nfnet | 128 | 0.9695 | 0.8983 | 0.3558 | 0.8027 | 1.0336 | 1.1302 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | 0.8595 | 1.0179 | 1.1759 | | Background_Matting | 4 | 1.0139 | 0.9632 | 0.3723 | nan | 0.9918 | 1.043 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3082 | nan | 0.9826 | 1.0135 | | BERT_pytorch | 16 | 0.9993 | 0.8806 | 0.3994 | 0.8629 | 0.9736 | 1.1219 | | hf_GPT2 | 4 | 0.9707 | 0.8847 | 0.3799 | 0.86 | 0.9648 | 1.1245 | | timm_regnet | 32 | 0.9954 | 0.8452 | 0.3492 | nan | 0.9346 | 1.0309 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 0.9097 | 0.9309 | 1.2521 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9945 | 0.878 | 0.4215 | nan | 0.9139 | 1.024 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3573 | nan | 0.911 | 1.0853 | | yolov3 | 16 | 0.9902 | 0.8373 | 0.3535 | 0.6534 | 0.9056 | 1.0455 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8359 | nan | 0.8306 | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | nan | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.7966 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.5764 | 0.8735 | 1.0608 | | hf_Bert | 4 | 0.9853 | 0.8759 | 0.3903 | 0.8648 | 0.873 | 0.9426 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | nan | 0.869 | 0.993 | | fastNLP_Bert | 6 | 1.0028 | 0.8977 | 0.3702 | 0.8847 | 0.8661 | 1.0682 | | resnet50 | 32 | 0.9898 | 0.8628 | 0.3561 | nan | 0.8659 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | 0.8541 | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9504 | 0.8808 | 0.341 | 0.8486 | 0.8383 | 0.9051 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.6247 | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9101 | 0.8312 | 0.3634 | 0.8107 | 0.8231 | 0.9873 | | hf_BigBird | 2 | 0.984 | 0.9787 | 0.4542 | 0.9342 | 0.8109 | 1.0963 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.7444 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3446 | nan | 0.791 | 0.8143 | | pytorch_stargan | 16 | 0.9955 | 0.9766 | 0.4263 | nan | 0.7801 | 0.8859 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3408 | nan | 0.7799 | 0.8875 | | resnext50_32x4d | 8 | 0.9951 | 0.8553 | 0.3886 | nan | 0.7647 | 0.775 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.6125 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | nan | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | nan | 0.7295 | 0.925 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.878 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9938 | 0.8822 | 0.3905 | 0.8514 | 0.7139 | 0.724 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.5571 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | hf_Reformer | 4 | 0.9861 | 0.9861 | 0.5889 | nan | 0.5295 | 0.9885 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5081 | 0.4235 | 0.4353 | | hf_GPT2_large | 4 | 0.9581 | 0.8718 | nan | 0.8628 | nan | 1.1354 | | dlrm | 2048 | nan | 0.7305 | nan | 0.7306 | nan | 0.7306 | | tacotron2 | 64 | 0.9862 | 0.3962 | 0.3141 | nan | nan | 0.4113 | | hf_Longformer | 2 | 0.9731 | 0.9666 | 0.3488 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 32 | 1.0195 | 0.8243 | 1.7968 | 0.6865 | 5.2367 | 1.801 | | YituTechConvBert | 1 | 1.024 | 0.831 | 2.4594 | 0.6855 | 4.7995 | 1.6173 | | MobileBertForQuestionAnswering | 64 | 1.0188 | 0.8185 | 1.5396 | 0.6712 | 3.7958 | 1.8223 | | CamemBert | 1 | 1.0438 | 0.8316 | 1.7408 | 0.6576 | 3.6217 | 1.8097 | | MT5ForConditionalGeneration | 8 | 1.0149 | 0.8743 | 1.5323 | 0.9379 | 3.574 | 2.4759 | | M2M100ForConditionalGeneration | 8 | 1.0116 | 0.8414 | 1.221 | 0.7337 | 2.7312 | 1.8492 | | DistillGPT2 | 1 | 1.0304 | 0.861 | 1.2461 | 0.2495 | 2.66 | 1.9899 | | GPT2ForSequenceClassification | 4 | 1.0005 | 0.9791 | 0.0 | 0.5063 | 2.3147 | 2.2729 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9685 | 0.7694 | 1.3232 | 2.1251 | 2.0666 | | MegatronBertForQuestionAnswering | 16 | 1.0365 | 0.8588 | 1.0366 | 0.7043 | 2.0304 | 1.7898 | | PLBartForConditionalGeneration | 16 | 1.0144 | 0.8258 | 1.0606 | 0.6829 | 1.9498 | 1.7335 | | MegatronBertForCausalLM | 16 | 1.0342 | 0.8662 | 1.0594 | 0.7062 | 1.8633 | 1.7744 | | LayoutLMForSequenceClassification | 16 | 1.0005 | 0.9802 | 0.7756 | 1.1831 | 1.8542 | 1.8102 | | ElectraForCausalLM | 32 | 1.0006 | 0.9292 | 0.7071 | 1.0477 | 1.8286 | 1.8296 | | T5Small | 1 | 1.0156 | 0.8709 | 1.1839 | 0.8691 | 1.8074 | 1.4146 | | XGLMForCausalLM | 8 | 1.0115 | 0.8257 | 0.9274 | 0.2743 | 1.7886 | 1.5929 | | PegasusForConditionalGeneration | 16 | 0.9872 | 0.8161 | 0.9202 | 0.6863 | 1.7005 | 1.5702 | | MBartForConditionalGeneration | 16 | 1.0113 | 0.8379 | 0.9032 | 0.6791 | 1.6872 | 1.6082 | | AlbertForQuestionAnswering | 4 | 1.0004 | 0.8854 | 0.0 | 1.2513 | 1.6663 | 1.6574 | | AlbertForMaskedLM | 4 | 1.0006 | 0.8849 | 0.0 | 1.2458 | 1.6566 | 1.6489 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9711 | 0.7558 | 1.1371 | 1.6513 | 1.6369 | | T5ForConditionalGeneration | 4 | 0.9965 | 0.9023 | 0.7538 | 1.1873 | 1.6173 | 1.588 | | Speech2Text2ForCausalLM | 128 | 1.002 | 0.9273 | 0.7189 | 0.8464 | 1.566 | 1.5923 | | OPTForCausalLM | 32 | 1.0108 | 0.9305 | 0.7787 | 0.3346 | 1.5636 | 1.5437 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.9836 | 0.7796 | 1.0891 | 1.5075 | 1.4653 | | BertForQuestionAnswering | 128 | 1.0 | 0.9845 | 0.7786 | 1.0802 | 1.496 | 1.4704 | | DistilBertForQuestionAnswering | 64 | 1.0014 | 0.9688 | 0.748 | 0.3559 | 1.4933 | 1.4501 | | BartForConditionalGeneration | 2 | 1.0035 | 0.9682 | 0.0 | 0.3197 | 1.4582 | 1.4288 | | RobertaForCausalLM | 64 | 1.0003 | 0.9591 | 0.7524 | 0.9484 | 1.444 | 1.4232 | | BartForCausalLM | 4 | 1.0007 | 0.9684 | 0.7572 | 1.0034 | 1.4427 | 1.4242 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0079 | 0.9207 | 0.7463 | 0.8566 | 1.4268 | 1.4642 | | BertForMaskedLM | 64 | 0.9995 | 0.9585 | 0.7405 | 0.9584 | 1.3515 | 1.3373 | | DebertaForMaskedLM | 4 | 0.9072 | 0.7134 | 0.787 | 0.0 | 1.2981 | 1.1452 | | PLBartForCausalLM | 32 | 1.0071 | 0.939 | 0.8301 | 0.8818 | 1.2813 | 1.2864 | | DistilBertForMaskedLM | 64 | 1.0004 | 0.952 | 0.7098 | 0.4589 | 1.2705 | 1.2649 | | BlenderbotSmallForCausalLM | 64 | 1.003 | 0.9262 | 0.7158 | 0.8811 | 1.2627 | 1.2907 | | MBartForCausalLM | 32 | 1.0035 | 0.9463 | 0.7548 | 0.8924 | 1.1967 | 1.1938 | | TrOCRForCausalLM | 32 | 1.0006 | 0.9466 | 0.7566 | 0.8839 | 1.1947 | 1.1968 | | PegasusForCausalLM | 32 | 1.0017 | 0.9361 | 0.7472 | 0.8902 | 1.1826 | 1.2192 | | DebertaForQuestionAnswering | 8 | 0.9893 | 0.8638 | 0.7238 | 0.0 | 1.1547 | 1.2025 | | BigBird | 1 | 0.9857 | 0.901 | 1.0379 | 0.8194 | 1.1426 | 1.0251 | | AllenaiLongformerBase | 1 | 0.9182 | 0.7217 | 0.8726 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BigBird | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.2039 | 11.4483 | 35.2016 | nan | 101.723 | 38.8687 | | DebertaForMaskedLM | 4 | 5.1579 | 11.4277 | 35.7592 | nan | 100.2849 | 37.1809 | | XGLMForCausalLM | 8 | 2.9804 | 13.7642 | 28.2128 | 224.6673 | 75.1365 | 72.9434 | | MobileBertForMaskedLM | 32 | 9.6922 | 32.1377 | 58.0526 | 358.1715 | 67.8398 | 65.7983 | | MobileBertForQuestionAnswering | 64 | 9.4749 | 33.0771 | 57.5565 | 368.1802 | 65.5677 | 64.1723 | | M2M100ForConditionalGeneration | 8 | 3.6966 | 17.0294 | 28.7001 | 320.4171 | 63.899 | 60.1142 | | PegasusForConditionalGeneration | 16 | 3.5571 | 17.0473 | 27.6315 | 335.4263 | 52.9764 | 49.7124 | | BartForConditionalGeneration | 2 | 3.6404 | 17.1168 | nan | 315.4639 | 50.2562 | 48.8086 | | MBartForConditionalGeneration | 16 | 3.7255 | 17.6216 | 28.9212 | 328.4802 | 50.1819 | 49.3859 | | BigBird | 1 | 8.3524 | 15.8594 | 30.8296 | 121.1051 | 44.2144 | 29.3041 | | YituTechConvBert | 1 | 2.6519 | 11.0476 | 17.991 | 160.4812 | 43.8513 | 41.3341 | | MegatronBertForCausalLM | 16 | 3.7322 | 14.3931 | 22.976 | 239.0209 | 41.8029 | 39.9144 | | MegatronBertForQuestionAnswering | 16 | 3.8774 | 14.5135 | 22.1991 | 241.3159 | 40.2522 | 39.0569 | | MT5ForConditionalGeneration | 8 | 3.7663 | 13.0484 | 21.1825 | 128.9855 | 37.5153 | 35.8479 | | BlenderbotSmallForConditionalGeneration | 64 | 2.3584 | 11.5031 | 19.4796 | 195.1605 | 34.3312 | 33.6868 | | T5Small | 1 | 2.6791 | 8.9082 | 13.3945 | 82.6197 | 32.0447 | 31.1183 | | T5ForConditionalGeneration | 4 | 2.5893 | 8.9683 | 13.9937 | 82.2253 | 30.2131 | 29.091 | | PLBartForConditionalGeneration | 16 | 1.9463 | 9.101 | 14.0493 | 152.8782 | 29.7226 | 29.1216 | | LayoutLMForSequenceClassification | 16 | 2.1697 | 7.7221 | 11.9189 | 103.2928 | 28.4985 | 27.6303 | | ElectraForCausalLM | 32 | 1.8243 | 7.2322 | 11.8398 | 107.5144 | 27.8612 | 25.0741 | | PegasusForCausalLM | 32 | 1.4486 | 7.0286 | 10.3468 | 99.049 | 24.841 | 23.4955 | | MBartForCausalLM | 32 | 1.4718 | 6.6836 | 10.1942 | 99.2287 | 23.7045 | 23.0471 | | LayoutLMForMaskedLM | 16 | 2.2277 | 7.5966 | 11.5761 | 104.2483 | 23.5468 | 21.8664 | | TrOCRForCausalLM | 32 | 1.4226 | 6.6435 | 10.0138 | 99.8449 | 22.9179 | 22.1228 | | BartForCausalLM | 4 | 1.4472 | 6.606 | 9.7927 | 95.9715 | 22.4559 | 21.8446 | | OPTForCausalLM | 32 | 1.4425 | 6.8953 | 11.8794 | 94.4011 | 21.9951 | 20.993 | | BertForMaskedLM | 64 | 1.7172 | 7.1971 | 10.5499 | 97.8209 | 21.8849 | 21.5014 | | RobertaForCausalLM | 64 | 1.7633 | 7.3588 | 10.8718 | 99.7743 | 21.863 | 21.1439 | | ElectraForQuestionAnswering | 64 | 1.8054 | 7.1984 | 10.5826 | 107.8436 | 21.6272 | 20.7925 | | BertForQuestionAnswering | 128 | 1.7614 | 7.2223 | 10.5569 | 99.972 | 21.1397 | 20.0655 | | CamemBert | 1 | 1.8165 | 7.0644 | 10.2778 | 109.1577 | 20.1693 | 19.5514 | | RobertaForQuestionAnswering | 128 | 1.7758 | 7.2068 | 10.9353 | 96.3265 | 19.9386 | 19.4211 | | GPT2ForSequenceClassification | 4 | 1.6786 | 6.4372 | nan | 68.4903 | 19.5413 | 18.7792 | | AlbertForMaskedLM | 4 | 1.629 | 6.7641 | nan | 98.0684 | 19.2728 | 18.0176 | | AlbertForQuestionAnswering | 4 | 1.607 | 6.8705 | nan | 96.6561 | 18.8036 | 17.6876 | | BlenderbotSmallForCausalLM | 64 | 0.9523 | 4.321 | 6.5971 | 67.2544 | 16.9157 | 16.4531 | | Speech2Text2ForCausalLM | 128 | 0.8355 | 3.452 | 5.9104 | 51.6587 | 15.9755 | 14.4025 | | PLBartForCausalLM | 32 | 0.7769 | 3.479 | 5.3725 | 57.5212 | 15.1674 | 14.9375 | | DistillGPT2 | 1 | 0.9088 | 3.379 | 4.6513 | 48.2957 | 14.0362 | 13.0891 | | DistilBertForMaskedLM | 64 | 0.7473 | 3.5784 | 5.8452 | 49.3421 | 13.3026 | 12.7282 | | DistilBertForQuestionAnswering | 64 | 0.7733 | 3.6197 | 5.9815 | 45.6609 | 12.3835 | 11.9014 | | AllenaiLongformerBase | 1 | 7.0427 | 15.7405 | 59.0974 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | 0.5409 | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | 0.5373 | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9162 | nan | 0.8823 | 1.0776 | 1.163 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | 0.8047 | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | 0.9311 | 1.017 | 1.0704 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | 0.8741 | 1.0109 | 1.0722 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | 0.8741 | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | 0.915 | 1.0045 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | 0.8978 | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.975 | 0.9115 | 0.4176 | 0.7451 | 0.9709 | 1.0364 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | 0.3625 | 0.8385 | 0.9662 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8919 | 0.396 | 0.8919 | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.8865 | 0.3606 | 0.863 | 0.9567 | 1.1277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3661 | 0.7765 | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0001 | 0.8924 | 0.3997 | 0.7337 | 0.9418 | 1.0115 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | 0.719 | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 0.999 | 0.8994 | 0.3787 | 0.718 | 0.9289 | 0.9788 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3467 | 0.81 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 0.9999 | 0.868 | 0.3727 | 0.6771 | 0.9249 | 1.0061 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | 0.8538 | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0001 | 0.8922 | 0.3997 | 0.7331 | 0.9211 | 0.9878 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9628 | 0.4377 | 0.9635 | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | 0.8399 | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 0.9997 | 0.8597 | 0.4044 | 0.8476 | 0.8918 | 1.0275 | | PLBartForConditionalGeneration | 16 | 0.9984 | 0.9002 | 0.4147 | 0.8751 | 0.8848 | 1.028 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8599 | 0.3635 | 0.6373 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.9197 | 0.8304 | 0.4067 | 0.6968 | 0.8756 | 0.9197 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 0.6046 | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | 0.6141 | 0.856 | 0.9327 | | PLBartForCausalLM | 32 | 1.0 | 0.844 | 0.3977 | 0.6218 | 0.8546 | 0.9358 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8173 | 0.3687 | 0.5749 | 0.846 | 0.9426 | | BigBird | 1 | 1.0009 | 0.9567 | 0.4478 | 0.8512 | 0.8172 | 1.0883 | | CamemBert | 1 | 0.9998 | 0.815 | 0.4163 | 0.7874 | 0.8062 | 0.9318 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | 0.7757 | 0.8055 | 0.9902 | | DistillGPT2 | 1 | 0.9964 | 0.7982 | 0.4006 | 0.7466 | 0.7997 | 1.0175 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9567 | 0.4257 | 0.8536 | 0.7882 | 1.0174 | | YituTechConvBert | 1 | 0.9711 | 0.8645 | 0.4304 | 0.8314 | 0.7879 | 0.9269 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | 0.6905 | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | 0.9859 | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9983 | 0.9808 | 0.3623 | nan | 0.4088 | 1.0667 | | DebertaForQuestionAnswering | 8 | 0.9753 | 1.0735 | 0.3251 | nan | 0.307 | 1.1932 | | AllenaiLongformerBase | 1 | 0.9995 | 0.9481 | 0.3847 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0012 | 0.0 | 0.0 | 0.2652 | 2.4181 | 1.7876 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9965 | 0.0 | 0.0 | 2.1299 | 2.0906 | | regnety_002 | 128 | 0.9785 | 0.9426 | 1.1142 | 0.0 | 2.1255 | 1.4337 | | ghostnet_100 | 128 | 1.002 | 0.9789 | 0.8858 | 0.0 | 2.1026 | 1.7902 | | lcnet_050 | 128 | 0.9658 | 0.9465 | 0.8494 | 0.0 | 2.0278 | 1.6248 | | twins_pcpvt_base | 64 | 1.0076 | 0.922 | 0.9105 | 0.0 | 1.9781 | 1.7211 | | coat_lite_mini | 128 | 0.9997 | 0.9707 | 0.8291 | 1.1688 | 1.7557 | 1.7283 | | res2net101_26w_4s | 64 | 1.0037 | 1.0068 | 0.9495 | 0.0 | 1.6105 | 1.3238 | | hrnet_w18 | 128 | 1.004 | 1.0645 | 0.8614 | 0.0 | 1.6043 | 1.4671 | | volo_d1_224 | 64 | 0.9997 | 0.9945 | 0.8451 | 0.0 | 1.5991 | 1.565 | | dla102 | 128 | 1.0006 | 0.9963 | 0.8392 | 0.0 | 1.5826 | 1.55 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.953 | 0.0 | 0.0 | 1.5604 | 1.5107 | | nfnet_l0 | 128 | 1.0001 | 0.8094 | 0.7133 | 0.9842 | 1.5499 | 1.4672 | | gmlp_s16_224 | 128 | 1.0001 | 0.9421 | 0.7513 | 1.0842 | 1.5406 | 1.518 | | resnest101e | 64 | 1.0042 | 0.9904 | 0.8144 | 0.0 | 1.5179 | 1.4319 | | adv_inception_v3 | 128 | 0.9999 | 0.9962 | 0.8548 | 1.1611 | 1.5067 | 1.4706 | | gluon_inception_v3 | 128 | 1.0001 | 0.993 | 0.8449 | 1.1611 | 1.5056 | 1.4701 | | dm_nfnet_f0 | 128 | 0.9991 | 0.9997 | 0.8801 | 1.0982 | 1.5044 | 1.4311 | | inception_v3 | 128 | 0.9999 | 0.9961 | 0.8544 | 1.163 | 1.5012 | 1.4666 | | gmixer_24_224 | 128 | 1.0001 | 0.8418 | 0.6964 | 0.988 | 1.4862 | 1.4987 | | res2net50_14w_8s | 128 | 1.0 | 1.0086 | 0.813 | 0.0 | 1.4778 | 1.4498 | | cait_m36_384 | 4 | 0.9997 | 1.0087 | 0.0 | 0.7949 | 1.4655 | 1.4584 | | mobilenetv3_large_100 | 128 | 0.9545 | 0.9437 | 0.7836 | 0.0 | 1.4509 | 1.4017 | | crossvit_9_240 | 128 | 0.9995 | 0.9897 | 0.8385 | 0.9641 | 1.4488 | 1.4082 | | selecsls42b | 128 | 0.9997 | 0.9958 | 0.8422 | 0.0 | 1.4418 | 1.4105 | | mnasnet_100 | 128 | 0.9529 | 0.9444 | 0.7896 | 0.0 | 1.4267 | 1.4447 | | res2next50 | 128 | 0.9995 | 0.9956 | 0.8316 | 0.0 | 1.4146 | 1.3456 | | fbnetv3_b | 128 | 0.9515 | 0.9466 | 0.774 | 0.0 | 1.4063 | 1.3934 | | mobilenetv2_100 | 128 | 0.9516 | 0.94 | 0.7164 | 0.0 | 1.3985 | 1.4278 | | convit_base | 64 | 0.9998 | 0.996 | 0.8332 | 1.3037 | 1.3862 | 1.3277 | | jx_nest_base | 32 | 0.9997 | 0.992 | 0.7991 | 0.0 | 1.3861 | 1.3552 | | ese_vovnet19b_dw | 128 | 0.9705 | 0.9646 | 0.7676 | 0.0 | 1.3746 | 1.3786 | | mobilevit_s | 64 | 0.9735 | 0.8147 | 0.657 | 0.0 | 1.3688 | 1.3559 | | spnasnet_100 | 128 | 0.9459 | 0.9369 | 0.7746 | 0.0 | 1.3668 | 1.3442 | | fbnetc_100 | 128 | 0.9529 | 0.9427 | 0.7943 | 0.0 | 1.3503 | 1.3746 | | pnasnet5large | 16 | 1.0066 | 1.0346 | 0.8568 | 0.0 | 1.347 | 1.2887 | | tf_efficientnet_b0 | 128 | 0.9651 | 0.8043 | 0.6674 | 0.0 | 1.3465 | 1.3505 | | cspdarknet53 | 64 | 0.942 | 0.93 | 0.757 | 1.1217 | 1.3292 | 1.3398 | | poolformer_m36 | 64 | 1.0 | 0.9974 | 0.8073 | 0.0 | 1.3272 | 1.2939 | | pit_b_224 | 64 | 0.9997 | 0.9948 | 0.8213 | 1.0001 | 1.3194 | 1.3126 | | botnet26t_256 | 128 | 0.9792 | 0.9703 | 0.8093 | 0.0 | 1.2883 | 1.2916 | | mixer_b16_224 | 128 | 1.0002 | 0.9583 | 0.7769 | 0.9337 | 1.283 | 1.265 | | eca_botnext26ts_256 | 128 | 0.9804 | 0.8099 | 0.6698 | 0.0 | 1.2799 | 1.2808 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9912 | 0.7967 | 1.0007 | 1.2796 | 1.265 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9773 | 0.0 | 0.9578 | 1.2774 | 1.2719 | | rexnet_100 | 128 | 0.9638 | 0.8489 | 0.6908 | 0.0 | 1.2734 | 1.286 | | tinynet_a | 128 | 0.9649 | 0.803 | 0.6659 | 0.0 | 1.2606 | 1.2596 | | visformer_small | 128 | 0.9996 | 1.0004 | 0.841 | 0.0 | 1.2321 | 1.182 | | sebotnet33ts_256 | 64 | 0.9666 | 0.8322 | 0.6745 | 0.0 | 1.1963 | 1.2121 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9931 | 0.8312 | 0.9547 | 1.1951 | 1.1787 | | tf_mixnet_l | 128 | 0.9805 | 0.9089 | 0.7958 | 0.0 | 1.1744 | 1.1699 | | mixnet_l | 128 | 0.9796 | 0.9052 | 0.7945 | 0.0 | 1.1589 | 1.1486 | | gluon_xception65 | 32 | 1.0 | 0.9878 | 0.7553 | 0.0 | 1.1583 | 1.1239 | | dpn107 | 32 | 0.9445 | 0.9283 | 0.7535 | 0.7887 | 1.1569 | 1.1749 | | repvgg_a2 | 128 | 0.9431 | 0.9318 | 0.7762 | 0.0 | 1.1391 | 1.1558 | | swsl_resnext101_32x16d | 32 | 0.9992 | 0.9821 | 0.8061 | 0.0 | 1.1301 | 1.054 | | resmlp_12_224 | 128 | 1.0003 | 1.0083 | 0.7901 | 1.3438 | 1.1042 | 1.0724 | | gernet_l | 128 | 0.9466 | 0.9375 | 0.7702 | 0.0 | 1.0653 | 1.076 | | convmixer_768_32 | 32 | 0.9999 | 0.9981 | 0.9234 | 0.0 | 1.0554 | 1.0505 | | convnext_base | 64 | 0.9993 | 0.9957 | 0.801 | 0.6522 | 0.6583 | 0.6514 | | eca_halonext26ts | 128 | 0.9807 | 0.8164 | 0.6788 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_accuracy | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.3103 | 29.6666 | 57.0128 | nan | 122.489 | 114.4404 | | twins_pcpvt_base | 64 | 2.7506 | 14.9663 | 25.7769 | nan | 97.6825 | 97.1842 | | pnasnet5large | 16 | 4.8831 | 22.2153 | 41.6387 | nan | 84.3566 | 80.2797 | | xcit_large_24_p8_224 | 5 | 3.2752 | nan | nan | 450.7919 | 80.0919 | 76.5741 | | swin_base_patch4_window7_224 | 64 | 3.1176 | 13.1833 | nan | nan | 79.2128 | 76.4142 | | mobilevit_s | 64 | 1.821 | 7.4257 | 15.6294 | nan | 76.3534 | 72.2325 | | convnext_base | 64 | 1.524 | 7.5008 | 11.3131 | 136.5031 | 71.3014 | 71.0576 | | cait_m36_384 | 4 | 3.421 | 19.6687 | nan | 429.6244 | 70.6074 | 68.5862 | | resnest101e | 64 | 3.5596 | 16.2849 | 27.2567 | nan | 68.8331 | 67.5903 | | res2net101_26w_4s | 64 | 3.3898 | 16.1956 | 27.6793 | nan | 60.7655 | 58.3188 | | jx_nest_base | 32 | 1.8766 | 9.5908 | 15.5267 | nan | 58.492 | 56.7978 | | res2net50_14w_8s | 128 | 2.9026 | 14.9827 | 24.4624 | nan | 56.2849 | 52.8816 | | coat_lite_mini | 128 | 1.218 | 5.2912 | 8.314 | 98.6037 | 52.28 | 51.9564 | | sebotnet33ts_256 | 64 | 1.7054 | 6.1821 | 14.0072 | nan | 52.0462 | 50.8585 | | poolformer_m36 | 64 | 2.0257 | 8.166 | 13.4198 | nan | 48.1441 | 45.7648 | | gmlp_s16_224 | 128 | 1.294 | 7.5006 | 12.8811 | 148.5605 | 44.0688 | 42.3026 | | dpn107 | 32 | 4.0785 | 13.4191 | 39.3515 | 207.551 | 43.2395 | 40.9895 | | eca_botnext26ts_256 | 128 | 1.4617 | 4.8786 | 10.9828 | nan | 43.0897 | 40.6337 | | fbnetv3_b | 128 | 3.332 | 11.6642 | 27.7302 | nan | 41.1548 | 40.2662 | | crossvit_9_240 | 128 | 1.6629 | 8.8462 | 13.3339 | 182.2862 | 40.925 | 40.1967 | | botnet26t_256 | 128 | 1.3964 | 4.2949 | 9.1949 | nan | 40.7898 | 39.0634 | | volo_d1_224 | 64 | 1.446 | 7.5621 | 12.8705 | nan | 40.749 | 37.9669 | | gluon_xception65 | 32 | 2.1006 | 10.874 | 17.7463 | nan | 40.3546 | 38.6455 | | tnt_s_patch16_224 | 128 | 1.8005 | 10.9681 | nan | nan | 40.1535 | 38.0001 | | inception_v3 | 128 | 1.6731 | 8.3068 | 13.2641 | 134.0862 | 37.5695 | 35.4856 | | gluon_inception_v3 | 128 | 1.6699 | 8.5979 | 13.2998 | 135.7813 | 37.5 | 35.1782 | | tf_mixnet_l | 128 | 5.8425 | 13.2375 | 26.6579 | nan | 36.9662 | 34.6119 | | adv_inception_v3 | 128 | 1.682 | 8.5695 | 13.2803 | 134.1434 | 36.861 | 35.2915 | | ghostnet_100 | 128 | 3.0394 | 9.6312 | 14.3899 | nan | 36.1116 | 35.1564 | | mixnet_l | 128 | 5.6136 | 12.5583 | 26.4426 | nan | 35.5466 | 34.6084 | | dla102 | 128 | 1.9026 | 9.3988 | 15.2898 | nan | 35.3706 | 33.3237 | | swsl_resnext101_32x16d | 32 | 1.9118 | 9.1597 | 14.6754 | nan | 34.7434 | 32.8643 | | gmixer_24_224 | 128 | 1.4638 | 8.2367 | 13.5491 | 146.1946 | 34.6762 | 33.104 | | dm_nfnet_f0 | 128 | 2.2015 | 7.3133 | 11.1546 | 143.0822 | 32.2726 | 30.6277 | | res2next50 | 128 | 1.8268 | 8.3196 | 12.8851 | nan | 31.9904 | 30.3178 | | rexnet_100 | 128 | 1.9747 | 7.6181 | 16.9778 | nan | 30.4796 | 28.7636 | | convit_base | 64 | 1.2557 | 6.2245 | 9.8883 | 131.7055 | 30.4181 | 29.8106 | | tinynet_a | 128 | 2.1556 | 8.0614 | 20.4583 | nan | 29.4132 | 28.0511 | | tf_efficientnet_b0 | 128 | 1.9346 | 7.1382 | 16.2109 | nan | 26.4734 | 24.3745 | | cspdarknet53 | 64 | 2.4326 | 7.6703 | 18.609 | 123.8398 | 25.9637 | 24.4766 | | fbnetc_100 | 128 | 2.1222 | 6.7103 | 17.0744 | nan | 24.8049 | 23.2788 | | mixer_b16_224 | 128 | 0.8835 | 3.875 | 5.9847 | 64.4646 | 24.7164 | 23.5622 | | visformer_small | 128 | 0.99 | 4.257 | 6.2854 | nan | 24.6834 | 23.0038 | | resmlp_12_224 | 128 | 0.7767 | 3.2545 | 5.0451 | 46.8154 | 24.6357 | 23.6778 | | convmixer_768_32 | 32 | 1.3528 | 6.253 | 9.8935 | nan | 24.4131 | 23.5473 | | spnasnet_100 | 128 | 2.0543 | 6.5157 | 17.091 | nan | 24.3068 | 23.4702 | | vit_base_patch16_224 | 64 | 0.9951 | 4.6233 | 7.7478 | 70.1628 | 24.296 | 22.4854 | | deit_base_distilled_patch16_224 | 64 | 1.0089 | 4.8373 | 7.3033 | 68.532 | 24.2484 | 23.1612 | | nfnet_l0 | 128 | 2.0342 | 7.3569 | 10.8326 | 126.2492 | 23.396 | 22.437 | | mobilenetv3_large_100 | 128 | 1.6341 | 5.6079 | 13.2695 | nan | 22.9939 | 22.079 | | beit_base_patch16_224 | 64 | 1.3591 | 5.5932 | nan | 95.7524 | 22.962 | 21.4329 | | pit_b_224 | 64 | 1.1212 | 5.4341 | 8.6623 | 89.1015 | 21.724 | 20.731 | | mobilenetv2_100 | 128 | 1.6469 | 5.4603 | 13.8637 | nan | 21.2525 | 20.1633 | | mnasnet_100 | 128 | 1.6509 | 5.2148 | 13.101 | nan | 20.3948 | 19.9418 | | regnety_002 | 128 | 1.6671 | 5.6295 | 12.9382 | nan | 20.3762 | 19.2443 | | gernet_l | 128 | 2.0437 | 6.0468 | 15.2207 | nan | 20.1203 | 19.1367 | | repvgg_a2 | 128 | 2.0466 | 6.1994 | 16.2721 | nan | 20.0213 | 19.1666 | | selecsls42b | 128 | 0.8581 | 3.8242 | 6.0042 | nan | 18.1995 | 17.3969 | | lcnet_050 | 128 | 1.0867 | 3.3344 | 7.5722 | nan | 14.862 | 14.2457 | | ese_vovnet19b_dw | 128 | 1.0772 | 3.2014 | 6.6896 | nan | 14.2209 | 13.48 | | eca_halonext26ts | 128 | 1.5068 | 5.0933 | 11.2144 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | nan | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9923 | 0.9245 | 0.3005 | 0.7341 | 1.3099 | 1.3731 | | gmlp_s16_224 | 128 | 0.9937 | 0.9496 | 0.3532 | 0.9046 | 1.2842 | 1.2998 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | nan | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1834 | 1.3102 | | pnasnet5large | 16 | 1.057 | 0.9912 | 0.3633 | nan | 1.1602 | 1.2928 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2852 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | nan | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.998 | 0.9431 | 0.3412 | nan | 1.1021 | 1.1166 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | nan | 1.0587 | 1.152 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 0.8278 | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9694 | 0.8983 | 0.3557 | 0.8024 | 1.0335 | 1.1297 | | nfnet_l0 | 128 | 0.9885 | 0.8173 | 0.2681 | 0.5287 | 1.0331 | 1.1819 | | beit_base_patch16_224 | 64 | 0.9953 | 0.9328 | nan | 0.8763 | 1.0004 | 1.0448 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 0.7317 | 0.9907 | 1.228 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | twins_pcpvt_base | 64 | 0.9945 | 0.9233 | 0.3401 | nan | 0.9743 | 1.0803 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0523 | | dla102 | 128 | 0.9694 | 0.9121 | 0.3363 | nan | 0.9556 | 1.0313 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | nan | 0.9489 | 1.0708 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9366 | 1.0876 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | 0.8717 | 0.932 | 0.9932 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | nan | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9999 | 0.9142 | nan | 0.9121 | 0.9291 | 0.979 | | ese_vovnet19b_dw | 128 | 0.9857 | 0.8565 | 0.3272 | nan | 0.9179 | 1.0682 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.8786 | 0.3676 | nan | 0.9112 | 0.981 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | 0.8116 | 0.9073 | 0.9799 | | dpn107 | 32 | 0.9969 | 0.9096 | 0.3531 | 0.817 | 0.9071 | 0.9971 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | gluon_xception65 | 32 | 0.9954 | 0.8854 | 0.3349 | nan | 0.8973 | 0.9761 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | nan | 0.8973 | 0.9876 | | inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.6586 | 0.8973 | 1.0246 | | gluon_inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.6586 | 0.8973 | 1.0246 | | adv_inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.6586 | 0.8973 | 1.0246 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | nan | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 0.8681 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 0.8683 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | nan | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | nan | 0.8769 | 0.9736 | | convnext_base | 64 | 1.0029 | 0.926 | 0.3509 | 0.7633 | 0.8761 | 0.9863 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | nan | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | nan | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8703 | 1.0094 | | gernet_l | 128 | 0.979 | 0.8501 | 0.3443 | nan | 0.8617 | 0.9854 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.6525 | 0.8606 | 1.0104 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | nan | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | nan | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | nan | 0.8371 | 1.0078 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 0.8252 | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 0.5942 | 0.8032 | 1.0344 | | repvgg_a2 | 128 | 0.9768 | 0.7822 | 0.3407 | nan | 0.7908 | 0.9916 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | 0.621 | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9969 | 0.9208 | nan | nan | 0.7569 | 0.9259 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3213 | nan | 0.745 | 0.8292 | | jx_nest_base | 32 | 0.9983 | 0.8928 | 0.3399 | nan | 0.6707 | 0.8617 | | eca_halonext26ts | 128 | 0.9886 | 0.7748 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/0G5oG6e.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/1NqA7Gk.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/kK7Xlxs.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 54/56 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 96%, 54/56 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 82%, 46/56 | 77%, 33/43  | 44%, 27/61  |
|    nvprims_nvfuser     | 82%, 46/56 | 60%, 26/43  | 67%, 41/61  |
|        inductor        | 86%, 48/56 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 52/56 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.05x    |    1.00x    |
|    nvprims_nvfuser     |   1.05x    |    1.03x    |    1.14x    |
|        inductor        |   1.50x    |    1.29x    |    1.24x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.15    |    2.42     |    1.92     |
|       aot_eager        |    5.82    |    7.70     |    7.04     |
|     aot_cudagraphs     |    8.66    |    15.96    |    13.09    |
|    nvprims_nvfuser     |   79.49    |   131.82    |   152.15    |
|        inductor        |   29.49    |    29.75    |    34.69    |
| inductor_no_cudagraphs |   28.94    |    25.44    |    33.11    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.92x    |    1.00x    |    0.93x    |
|        inductor        |   0.87x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   1.01x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | soft_actor_critic | 1.4193 | 0.9377 | | torchbench | nvidia_deeprecommender | 0.9042 | 0.9637 | | torchbench | dlrm | 0.0 | 1.0822 | | torchbench | hf_GPT2_large | 0.0 | 1.4736 | | torchbench | hf_T5 | 0.0 | 1.5675 | | torchbench | tacotron2 | 0.0 | 0.9038 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.8188 | 0.8155 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5389 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | yolov3 | 371.3674 | 370.5636 | | torchbench | timm_efficientdet | 122.9013 | 119.0748 | | torchbench | tacotron2 | nan | 63.2354 | | torchbench | hf_GPT2_large | nan | 41.4188 | | torchbench | hf_T5 | nan | 26.1883 | | torchbench | dlrm | nan | 2.8855 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | tnt_s_patch16_224 | nan | 31.3326 | +-------------+-----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0018 | | torchbench | hf_Albert | 0.8836 | 1.2212 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8622 | 1.0312 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8318 | 1.1278 | | torchbench | resnext50_32x4d | 0.8302 | 0.8356 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | hf_BigBird | 0.8211 | 1.0391 | | torchbench | dcgan | 0.767 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7517 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7062 | 1.0016 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6889 | 0.9399 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4214 | | torchbench | tacotron2 | nan | 1.1623 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | dlrm | nan | 0.7306 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8453 | 1.0606 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | BigBird | 0.821 | 1.0085 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | M2M100ForConditionalGeneration | 0.8055 | 1.0093 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | YituTechConvBert | 0.7898 | 0.8725 | | huggingface | PegasusForCausalLM | 0.7774 | 0.931 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | GoogleFnet | 0.7698 | 0.9372 | | huggingface | MT5ForConditionalGeneration | 0.763 | 0.9406 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7487 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.7238 | 0.9373 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6763 | 0.8849 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.386 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8822 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8563 | 1.0753 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | coat_lite_mini | 0.821 | 1.0246 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | resmlp_12_224 | 0.7899 | 0.7979 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7462 | 0.9008 | | timm_models | crossvit_9_240 | 0.6584 | 0.8853 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/4JMJzio.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/sfCpKmX.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.001 | 0.9988 | 2.395 | 0.7665 | 5.2607 | 1.2692 | | timm_efficientdet | 1 | 0.9837 | 0.8861 | 1.8874 | 0.7648 | 4.2817 | 1.5167 | | functorch_dp_cifar10 | 64 | 1.0012 | 1.0262 | 2.1529 | 0.0 | 3.6546 | 1.2492 | | timm_vision_transformer | 8 | 1.0069 | 0.9168 | 1.652 | 0.6511 | 2.573 | 1.4003 | | drq | 1 | 1.0113 | 0.8496 | 1.6761 | 0.7622 | 2.475 | 1.143 | | BERT_pytorch | 16 | 1.0073 | 0.8952 | 1.0891 | 0.9843 | 2.0976 | 2.0509 | | resnext50_32x4d | 8 | 1.0008 | 1.1113 | 1.2327 | 0.8145 | 2.0211 | 1.2046 | | mobilenet_v3_large | 32 | 1.0073 | 1.1131 | 1.0951 | 0.8637 | 1.9681 | 1.2848 | | hf_T5_large | 2 | 1.0237 | 0.888 | 0.0 | 0.0 | 1.9215 | 1.6497 | | pytorch_struct | 200 | 0.9956 | 0.7482 | 0.8979 | 0.7933 | 1.8974 | 1.16 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9988 | 1.0103 | 1.3327 | 0.851 | 1.8353 | 1.5085 | | resnet18 | 16 | 1.006 | 1.1108 | 1.1505 | 0.8903 | 1.8088 | 1.2081 | | lennard_jones | 1000 | 0.9626 | 0.8544 | 1.0743 | 0.6859 | 1.7757 | 0.9521 | | squeezenet1_1 | 32 | 1.0008 | 1.0114 | 1.0489 | 0.8733 | 1.7454 | 1.2702 | | dcgan | 32 | 0.9721 | 1.0377 | 1.2568 | 0.8093 | 1.6629 | 1.0769 | | hf_Albert | 8 | 1.0017 | 0.996 | 0.7511 | 1.5572 | 1.6559 | 1.6378 | | shufflenet_v2_x1_0 | 128 | 1.0 | 1.0457 | 0.8112 | 0.9026 | 1.5469 | 1.3808 | | speech_transformer | 32 | 1.0063 | 0.8953 | 1.5171 | 0.8256 | 1.5374 | 1.5634 | | timm_resnest | 32 | 0.999 | 1.0016 | 0.8045 | 1.1652 | 1.5176 | 1.4521 | | hf_GPT2 | 4 | 1.0192 | 0.9791 | 0.7424 | 0.3845 | 1.5 | 1.4963 | | timm_nfnet | 128 | 0.9994 | 1.0002 | 0.0 | 1.1336 | 1.4784 | 1.4235 | | mnasnet1_0 | 32 | 0.9984 | 1.0959 | 0.8627 | 0.9187 | 1.4696 | 1.2788 | | mobilenet_v2_quantized_qat | 96 | 1.0004 | 0.9806 | 0.0 | 1.4595 | 1.4295 | 1.4283 | | mobilenet_v2 | 96 | 0.9999 | 0.9985 | 0.7318 | 1.337 | 1.4271 | 1.4087 | | fastNLP_Bert | 6 | 0.9988 | 0.9763 | 0.7527 | 1.1757 | 1.4213 | 1.3928 | | soft_actor_critic | 256 | 0.9819 | 0.7987 | 1.0847 | 0.7097 | 1.4193 | 0.9377 | | resnet50_quantized_qat | 32 | 1.0015 | 0.9735 | 0.0 | 1.1563 | 1.3891 | 1.3825 | | timm_efficientnet | 32 | 0.958 | 0.8111 | 0.7042 | 0.8165 | 1.3331 | 1.203 | | pytorch_stargan | 16 | 0.9999 | 1.0751 | 0.9321 | 0.0 | 1.3015 | 1.2279 | | LearningToPaint | 96 | 1.0013 | 1.0572 | 0.8701 | 0.9721 | 1.274 | 1.2071 | | resnet152 | 32 | 0.9991 | 1.0639 | 0.8084 | 0.899 | 1.2362 | 1.1909 | | resnet50 | 32 | 0.9996 | 0.9926 | 0.7596 | 0.988 | 1.206 | 1.1809 | | hf_Bart | 4 | 1.0116 | 0.9758 | 0.7435 | 0.9077 | 1.2028 | 1.1954 | | pytorch_unet | 1 | 0.9998 | 0.9976 | 0.8464 | 1.0897 | 1.2022 | 1.1886 | | hf_Bert | 4 | 1.0228 | 0.9956 | 0.7337 | 0.9029 | 1.1973 | 1.1913 | | vgg16 | 64 | 0.9998 | 0.9989 | 0.8591 | 0.9977 | 1.1734 | 1.1685 | | hf_DistilBert | 8 | 1.0012 | 0.9564 | 0.6912 | 0.5249 | 1.1728 | 1.1783 | | alexnet | 128 | 1.0 | 0.9982 | 0.8037 | 1.0035 | 1.1616 | 1.1635 | | Super_SloMo | 6 | 0.9996 | 0.9979 | 0.8684 | 0.992 | 1.1371 | 1.1203 | | hf_Reformer | 4 | 0.9981 | 1.002 | 0.9885 | 0.7364 | 1.132 | 1.1408 | | timm_regnet | 32 | 0.9656 | 0.9623 | 0.7794 | 1.0931 | 1.1309 | 1.092 | | Background_Matting | 4 | 1.0003 | 1.0217 | 0.8692 | 1.0792 | 1.1228 | 1.1142 | | yolov3 | 16 | 1.0 | 0.9941 | 0.7909 | 1.1515 | 1.0949 | 1.0786 | | hf_BigBird | 2 | 0.9898 | 0.9386 | 0.9484 | 0.9081 | 1.0917 | 0.9987 | | attention_is_all_you_need_pytorch | 256 | 1.0 | 0.966 | 0.7553 | 0.9536 | 1.0649 | 1.0484 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9927 | 0.0 | 0.0 | 1.0456 | 1.0336 | | tts_angular | 64 | 0.9847 | 0.962 | 0.989 | 0.9705 | 1.0114 | 1.0034 | | timm_vovnet | 32 | 0.9111 | 0.9046 | 0.7153 | 0.9012 | 1.006 | 1.0174 | | demucs | 4 | 1.0005 | 1.0 | 0.9997 | 1.0004 | 1.0003 | 0.9994 | | nvidia_deeprecommender | 256 | 0.9991 | 0.9636 | 0.585 | 0.9769 | 0.9042 | 0.9637 | | dlrm | 2048 | 0.0 | 1.1379 | 0.0 | 1.0523 | 0.0 | 1.0822 | | hf_GPT2_large | 4 | 0.9991 | 0.9797 | 0.0 | 0.0 | 0.0 | 1.4736 | | hf_T5 | 8 | 1.0008 | 0.9523 | 0.0 | 1.2243 | 0.0 | 1.5675 | | tacotron2 | 64 | 0.9814 | 0.8481 | 0.0 | 0.7419 | 0.0 | 0.9038 | | hf_Longformer | 2 | 0.9529 | 0.8811 | 0.8129 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8229 | 7.348 | 10.0403 | 112.9184 | 371.3674 | 370.5636 | | timm_efficientdet | 1 | 19.3112 | 33.3551 | 66.5144 | 489.8015 | 122.9013 | 119.0748 | | hf_T5_large | 2 | 13.5835 | 35.0856 | nan | nan | 103.1211 | 100.4682 | | timm_vision_transformer_large | 8 | 2.2996 | 11.0844 | nan | nan | 49.8259 | 50.0915 | | attention_is_all_you_need_pytorch | 256 | 1.1059 | 5.7244 | 8.7516 | 131.9068 | 45.0715 | 44.5644 | | densenet121 | 4 | 2.0314 | 9.7001 | 15.6137 | 166.601 | 41.627 | 40.4934 | | resnet152 | 32 | 2.2853 | 10.5785 | 17.4044 | 187.9947 | 39.6123 | 38.0203 | | timm_resnest | 32 | 0.5311 | 1.9764 | 3.078 | 59.6083 | 39.3072 | 39.6006 | | hf_BigBird | 2 | 7.6367 | 12.591 | 25.3012 | 97.7114 | 37.7909 | 25.6859 | | timm_vision_transformer | 8 | 0.772 | 3.3586 | 4.9667 | 75.7931 | 31.7662 | 30.9689 | | hf_Bart | 4 | 1.571 | 6.6304 | 10.1957 | 152.7631 | 28.2025 | 27.0672 | | timm_nfnet | 128 | 1.9392 | 6.1979 | nan | 154.905 | 27.7914 | 27.2483 | | BERT_pytorch | 16 | 1.4998 | 6.004 | 8.8078 | 100.7325 | 26.5481 | 26.0481 | | resnet50_quantized_qat | 32 | 1.0909 | 6.9368 | nan | 168.9774 | 26.4876 | 26.5798 | | mobilenet_v2_quantized_qat | 96 | 1.2166 | 7.2914 | nan | 182.1821 | 26.2448 | 26.2181 | | pytorch_stargan | 16 | 0.3913 | 1.6659 | 2.5967 | nan | 26.1729 | 26.4173 | | fastNLP_Bert | 6 | 1.4482 | 5.3 | 8.6573 | 102.4501 | 25.5138 | 24.3378 | | speech_transformer | 32 | 1.5676 | 6.5691 | 25.3587 | 151.5973 | 25.4907 | 25.4424 | | mobilenet_v3_large | 32 | 0.8315 | 3.8434 | 5.8631 | 99.4439 | 23.148 | 21.917 | | timm_regnet | 32 | 2.2276 | 6.5145 | 17.1193 | 111.4363 | 23.0573 | 22.1113 | | timm_efficientnet | 32 | 1.6787 | 5.5672 | 13.5517 | 107.689 | 22.6889 | 22.1896 | | pytorch_struct | 200 | 0.243 | 0.6319 | 1.1542 | 4.4928 | 22.332 | 19.0892 | | hf_Reformer | 4 | 1.6905 | 2.8654 | 5.4598 | 15.8557 | 19.6342 | 15.7578 | | hf_Bert | 4 | 1.4977 | 5.2418 | 7.7241 | 105.5213 | 18.2205 | 17.7942 | | mnasnet1_0 | 32 | 0.7551 | 3.4332 | 5.1257 | 72.6196 | 17.9304 | 17.388 | | resnet50 | 32 | 0.8156 | 3.5951 | 5.5086 | 77.8733 | 17.4999 | 17.3047 | | timm_vovnet | 32 | 1.4584 | 3.7635 | 8.7057 | 56.9466 | 17.43 | 17.3119 | | hf_Albert | 8 | 1.1838 | 4.6267 | 7.3866 | 115.8684 | 17.2064 | 16.4424 | | shufflenet_v2_x1_0 | 128 | 0.859 | 3.9835 | 6.1295 | 82.8984 | 17.204 | 16.6087 | | hf_GPT2 | 4 | 1.5359 | 4.9631 | 7.7576 | 83.3684 | 16.7975 | 15.9777 | | resnext50_32x4d | 8 | 0.834 | 3.6165 | 5.4626 | 65.6794 | 16.4017 | 16.4698 | | mobilenet_v2 | 96 | 0.7441 | 3.6858 | 5.8002 | 98.2439 | 16.1192 | 15.6424 | | Background_Matting | 4 | 0.6827 | 3.3115 | 5.1857 | 69.09 | 15.4452 | 14.6275 | | Super_SloMo | 6 | 0.8317 | 3.264 | 4.7267 | 27.4705 | 14.097 | 13.8203 | | functorch_dp_cifar10 | 64 | 0.345 | 1.3499 | 2.0103 | nan | 12.7286 | 12.3125 | | hf_DistilBert | 8 | 0.6346 | 2.5953 | 4.9238 | 47.6289 | 11.6879 | 11.2023 | | resnet18 | 16 | 0.3877 | 1.4445 | 2.1025 | 29.624 | 10.8056 | 10.0768 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3638 | 1.5586 | 2.3068 | 33.2275 | 8.039 | 7.7886 | | pytorch_unet | 1 | 0.3651 | 1.3748 | 2.1491 | 30.2876 | 7.1133 | 6.8579 | | LearningToPaint | 96 | 0.4119 | 1.5082 | 2.3356 | 39.1172 | 6.889 | 6.5069 | | squeezenet1_1 | 32 | 0.2235 | 0.64 | 1.0011 | 4.6323 | 3.8498 | 3.6406 | | drq | 1 | 0.2955 | 0.5005 | 0.838 | 4.5372 | 3.7382 | 3.405 | | vgg16 | 64 | 0.1751 | 0.4608 | 0.9011 | 3.1157 | 3.4269 | 3.2041 | | soft_actor_critic | 256 | 0.1986 | 0.2929 | 0.4936 | 1.6238 | 3.3744 | 2.7604 | | nvidia_deeprecommender | 256 | 0.1906 | 0.4015 | 0.6357 | 4.8685 | 3.3046 | 3.0138 | | alexnet | 128 | 0.1538 | 0.3142 | 0.5821 | 3.1265 | 2.8742 | 2.7099 | | dcgan | 32 | 0.1673 | 0.3595 | 0.5578 | 4.4151 | 2.6327 | 2.4154 | | lennard_jones | 1000 | 0.1398 | 0.2405 | 0.3952 | 1.3833 | 1.9405 | 1.7779 | | tts_angular | 64 | 0.2092 | 0.2462 | 0.3707 | 1.094 | 1.8955 | 1.769 | | demucs | 4 | 0.302 | 0.2974 | 0.2996 | 0.3081 | 0.2284 | 0.2137 | | tacotron2 | 64 | 17.1437 | 28.4388 | nan | 63.4764 | nan | 63.2354 | | hf_GPT2_large | 4 | 5.0617 | 16.0219 | nan | nan | nan | 41.4188 | | hf_T5 | 8 | 2.3428 | 7.7269 | nan | 89.2411 | nan | 26.1883 | | dlrm | 2048 | nan | 0.7252 | nan | 2.8631 | nan | 2.8855 | | hf_Longformer | 2 | 6.1322 | 13.0505 | 56.5726 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5819 | 1.5819 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.2258 | 1.4874 | 1.4877 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.988 | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3631 | 0.9891 | 1.2026 | 1.4001 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.289 | 1.1336 | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2707 | 1.021 | 1.0388 | 1.0455 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.7594 | 1.0219 | 1.0963 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9682 | 0.9832 | 1.0394 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 0.9814 | 1.0418 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 1.1241 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9405 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.9238 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3917 | 0.8952 | 0.9183 | 0.9986 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8497 | 0.9118 | 1.105 | | resnet152 | 32 | 0.9975 | 0.9157 | 0.3424 | 0.8735 | 0.9067 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.8807 | 0.3236 | 0.7927 | 0.8982 | 1.0018 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 1.0621 | 0.8836 | 1.2212 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | nan | 0.8622 | 1.0312 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2131 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8774 | 0.3385 | 1.0863 | 0.8318 | 1.1278 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | 0.8203 | 0.8302 | 0.8356 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3502 | 1.1265 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4297 | 1.1745 | 0.8211 | 1.0391 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3308 | 1.0642 | 0.7517 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9636 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 1.0011 | 0.7062 | 1.0016 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9452 | 0.6912 | 0.3387 | 0.6275 | 0.6889 | 0.9399 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3214 | 1.0216 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | 0.9862 | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9678 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | 1.0227 | nan | 1.1623 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.9326 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | 0.7306 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2944 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0277 | 0.9001 | 2.1059 | 0.7564 | 3.235 | 1.4156 | | CamemBert | 1 | 1.0496 | 0.9235 | 1.3322 | 0.739 | 2.3743 | 1.5155 | | MT5ForConditionalGeneration | 8 | 1.0283 | 0.9225 | 1.4011 | 0.9986 | 2.332 | 1.9867 | | DistillGPT2 | 1 | 1.0354 | 0.92 | 1.0519 | 0.0 | 2.16 | 1.8036 | | MobileBertForMaskedLM | 32 | 1.0235 | 0.9279 | 1.3123 | 0.0 | 2.0445 | 1.5427 | | GoogleFnet | 1 | 0.9836 | 0.8082 | 0.9719 | 0.0 | 1.9892 | 1.1563 | | GPT2ForSequenceClassification | 4 | 1.0 | 0.9747 | 0.0 | 0.7099 | 1.7962 | 1.7804 | | M2M100ForConditionalGeneration | 8 | 1.0381 | 0.9399 | 0.9734 | 0.7983 | 1.6009 | 1.3123 | | T5ForConditionalGeneration | 4 | 0.9997 | 0.9333 | 0.7284 | 1.1521 | 1.4503 | 1.4454 | | ElectraForQuestionAnswering | 64 | 1.0009 | 0.9745 | 0.0 | 1.2366 | 1.4251 | 1.4065 | | MobileBertForQuestionAnswering | 64 | 1.0215 | 0.9384 | 1.0093 | 0.0 | 1.4135 | 1.2957 | | ElectraForCausalLM | 32 | 1.0005 | 0.9324 | 0.0 | 1.035 | 1.4118 | 1.451 | | LayoutLMForSequenceClassification | 16 | 1.0002 | 0.9893 | 0.7377 | 1.1456 | 1.307 | 1.2908 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0014 | 0.0 | 1.2305 | 1.2563 | 1.2557 | | T5Small | 1 | 1.0273 | 0.9303 | 0.9714 | 0.9759 | 1.2479 | 1.1306 | | AlbertForMaskedLM | 4 | 1.0007 | 0.9996 | 0.0 | 1.2297 | 1.2467 | 1.2502 | | PLBartForConditionalGeneration | 16 | 1.014 | 0.9636 | 0.8126 | 0.8246 | 1.2128 | 1.2736 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9709 | 0.0 | 1.0816 | 1.2102 | 1.2093 | | OPTForCausalLM | 32 | 1.0035 | 0.9175 | 0.7139 | 0.4562 | 1.1895 | 1.2078 | | XGLMForCausalLM | 8 | 1.0115 | 0.9459 | 0.7589 | 0.3184 | 1.1772 | 1.1826 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.9859 | 0.7124 | 0.5195 | 1.1713 | 1.1506 | | RobertaForCausalLM | 64 | 1.0004 | 0.9631 | 0.7458 | 0.974 | 1.1498 | 1.1552 | | MegatronBertForQuestionAnswering | 16 | 1.0416 | 1.0111 | 0.8118 | 0.7641 | 1.1424 | 1.1187 | | MegatronBertForCausalLM | 16 | 1.0317 | 1.0045 | 0.7493 | 0.8059 | 1.1369 | 1.1224 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.926 | 0.661 | 0.9237 | 1.1206 | 1.1414 | | BertForQuestionAnswering | 128 | 1.0 | 0.9931 | 0.0 | 1.0253 | 1.1128 | 1.107 | | RobertaForQuestionAnswering | 128 | 1.0004 | 0.986 | 0.0 | 1.0271 | 1.1125 | 1.1059 | | BartForCausalLM | 4 | 1.0007 | 0.9664 | 0.755 | 0.9957 | 1.0991 | 1.1102 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9876 | 0.0 | 0.4513 | 1.0954 | 1.0892 | | BigBird | 1 | 0.9883 | 0.9281 | 1.0011 | 0.0 | 1.0909 | 1.0014 | | MBartForConditionalGeneration | 16 | 1.012 | 0.9863 | 0.7617 | 0.0 | 1.0895 | 1.0922 | | DebertaForMaskedLM | 4 | 0.9124 | 0.8018 | 0.7286 | 0.6426 | 1.087 | 1.0507 | | PegasusForConditionalGeneration | 16 | 1.0102 | 0.9818 | 0.765 | 0.8939 | 1.0816 | 1.0844 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9392 | 0.0 | 0.9446 | 1.0629 | 1.0707 | | BertForMaskedLM | 64 | 0.9999 | 0.9614 | 0.7297 | 0.9654 | 1.0576 | 1.0581 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9516 | 0.7071 | 0.6327 | 1.0516 | 1.0666 | | DebertaForQuestionAnswering | 8 | 0.9964 | 0.9898 | 0.683 | 0.8056 | 1.0475 | 1.2202 | | PLBartForCausalLM | 32 | 1.0055 | 0.9353 | 0.711 | 0.9104 | 1.027 | 1.0539 | | BlenderbotSmallForCausalLM | 64 | 1.0019 | 0.9109 | 0.6825 | 0.9074 | 1.0099 | 1.0416 | | TrOCRForCausalLM | 32 | 1.0018 | 0.9533 | 0.7335 | 0.9471 | 1.0034 | 1.0144 | | MBartForCausalLM | 32 | 1.0016 | 0.9526 | 0.7318 | 0.0 | 0.9988 | 1.0072 | | PegasusForCausalLM | 32 | 0.9997 | 0.9521 | 0.7281 | 0.9455 | 0.9908 | 1.0014 | | AllenaiLongformerBase | 1 | 0.9339 | 0.85 | 0.7752 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForMaskedLM | 4 | 4.8471 | 10.2157 | 33.4302 | 80.4053 | 93.7265 | 33.0308 | | DebertaForQuestionAnswering | 8 | 4.7792 | 10.3188 | 33.9168 | 79.0327 | 90.8885 | 33.9877 | | XGLMForCausalLM | 8 | 2.4818 | 10.1556 | 22.0581 | 245.2264 | 67.3309 | 65.0147 | | M2M100ForConditionalGeneration | 8 | 2.9979 | 12.0702 | 20.0821 | 317.6287 | 56.3851 | 54.0961 | | MobileBertForMaskedLM | 32 | 8.1686 | 23.905 | 41.5174 | nan | 48.6315 | 48.1537 | | MobileBertForQuestionAnswering | 64 | 8.4347 | 23.6359 | 41.7267 | nan | 47.9846 | 47.1236 | | PegasusForConditionalGeneration | 16 | 2.8373 | 12.1133 | 20.8106 | 355.5235 | 42.3267 | 39.1191 | | BartForConditionalGeneration | 2 | 3.0585 | 12.3393 | nan | 329.3429 | 42.1796 | 40.1639 | | MBartForConditionalGeneration | 16 | 3.0151 | 12.453 | 21.6777 | nan | 41.0491 | 40.3873 | | YituTechConvBert | 1 | 2.2506 | 8.2934 | 12.7453 | 137.8478 | 39.621 | 36.7736 | | BigBird | 1 | 7.5308 | 12.6636 | 25.1926 | nan | 38.3284 | 24.3887 | | MegatronBertForCausalLM | 16 | 3.2955 | 10.5543 | 17.2987 | 250.4064 | 32.7008 | 31.3638 | | MegatronBertForQuestionAnswering | 16 | 3.1445 | 10.5425 | 17.5 | 249.5424 | 31.8934 | 30.5789 | | MT5ForConditionalGeneration | 8 | 3.7285 | 11.0703 | 18.1745 | 140.9568 | 31.7799 | 30.4261 | | BlenderbotSmallForConditionalGeneration | 64 | 1.8808 | 8.3771 | nan | 200.3353 | 29.0649 | 27.9055 | | T5ForConditionalGeneration | 4 | 2.3868 | 7.8539 | 11.9274 | 90.2675 | 28.9521 | 28.1485 | | T5Small | 1 | 2.3623 | 7.8398 | 11.3304 | 90.8521 | 28.0045 | 27.5485 | | LayoutLMForSequenceClassification | 16 | 1.9576 | 5.7289 | 9.1147 | 101.1574 | 27.0433 | 25.9872 | | PLBartForConditionalGeneration | 16 | 1.6125 | 6.6283 | 10.2276 | 140.0065 | 25.8098 | 25.2659 | | ElectraForCausalLM | 32 | 1.5474 | 5.3342 | nan | 102.5381 | 25.6255 | 23.8514 | | GoogleFnet | 1 | 0.9509 | 2.8484 | 10.7191 | nan | 21.7553 | 13.8829 | | PegasusForCausalLM | 32 | 1.1675 | 4.8397 | 8.1969 | 103.9942 | 21.3784 | 19.3866 | | MBartForCausalLM | 32 | 1.1579 | 4.6156 | 7.5042 | nan | 20.8173 | 20.3006 | | LayoutLMForMaskedLM | 16 | 1.8608 | 5.792 | nan | 106.3746 | 20.6033 | 19.8601 | | BertForMaskedLM | 64 | 1.4448 | 5.2977 | 8.0637 | 108.1399 | 19.8694 | 18.8434 | | ElectraForQuestionAnswering | 64 | 1.5004 | 5.4545 | nan | 102.1465 | 19.3585 | 18.8843 | | BertForQuestionAnswering | 128 | 1.4983 | 5.2476 | nan | 103.8848 | 19.3224 | 18.3602 | | TrOCRForCausalLM | 32 | 1.0965 | 4.7812 | 7.3857 | 97.1349 | 19.2529 | 18.5904 | | BartForCausalLM | 4 | 1.1571 | 4.6797 | 7.3466 | 101.8442 | 19.0784 | 18.336 | | RobertaForCausalLM | 64 | 1.5411 | 5.486 | 8.4298 | 106.522 | 18.922 | 18.2183 | | RobertaForQuestionAnswering | 128 | 1.4937 | 5.6677 | nan | 101.7783 | 18.1796 | 17.3282 | | CamemBert | 1 | 1.5437 | 5.4371 | 7.6584 | 101.9501 | 18.0867 | 17.6337 | | OPTForCausalLM | 32 | 1.2608 | 4.9045 | 9.1904 | 93.3834 | 16.891 | 16.4702 | | AlbertForMaskedLM | 4 | 1.0443 | 4.7109 | nan | 113.4743 | 16.4078 | 15.1718 | | GPT2ForSequenceClassification | 4 | 1.4915 | 5.1097 | nan | 84.5624 | 16.2409 | 15.6826 | | AlbertForQuestionAnswering | 4 | 1.2817 | 4.7001 | nan | 109.48 | 16.065 | 15.0931 | | BlenderbotSmallForCausalLM | 64 | 0.7711 | 3.2726 | 5.0345 | 62.393 | 15.0954 | 13.7339 | | Speech2Text2ForCausalLM | 128 | 0.7109 | 2.6487 | 4.4939 | 44.0811 | 13.9155 | 13.5659 | | PLBartForCausalLM | 32 | 0.6442 | 2.5492 | 3.9785 | 49.1263 | 13.3668 | 12.9602 | | DistillGPT2 | 1 | 0.7913 | 2.6323 | 3.8856 | nan | 12.9831 | 11.7076 | | DistilBertForMaskedLM | 64 | 0.6168 | 2.5731 | 4.704 | 48.8701 | 11.5146 | 10.9195 | | DistilBertForQuestionAnswering | 64 | 0.6277 | 2.5936 | 4.9447 | 45.1509 | 10.999 | 10.2 | | AllenaiLongformerBase | 1 | 6.1661 | 13.122 | 56.3507 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 1.1727 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.9118 | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 1.0877 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.972 | 0.8215 | 1.1049 | | BigBird | 1 | 0.9979 | 0.9534 | 0.4201 | nan | 0.821 | 1.0085 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3765 | nan | 0.8057 | 0.9257 | | M2M100ForConditionalGeneration | 8 | 0.9866 | 0.9611 | 0.3866 | 1.0078 | 0.8055 | 1.0093 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.844 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9863 | 0.8581 | 0.3682 | 0.8984 | 0.7898 | 0.8725 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9963 | 0.7774 | 0.931 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3715 | nan | 0.7698 | 0.9372 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4151 | 0.9335 | 0.763 | 0.9406 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.9908 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3613 | 0.8613 | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.9443 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.358 | 1.0146 | 0.7238 | 0.9373 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.9984 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 1.1317 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.9997 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.908 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8656 | 0.3607 | 0.9159 | 0.6763 | 0.8849 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.888 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.9904 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.989 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3551 | 0.9719 | 0.386 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.1591 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3207 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.999 | 0.9732 | 0.8285 | 1.2911 | 1.8678 | 1.8255 | | lcnet_050 | 128 | 0.9553 | 0.9499 | 0.7513 | 1.2877 | 1.6346 | 1.6346 | | regnety_002 | 128 | 0.9795 | 0.9995 | 0.8606 | 0.9668 | 1.5041 | 1.3293 | | dm_nfnet_f0 | 128 | 0.9995 | 1.0003 | 0.0 | 1.1378 | 1.4733 | 1.4236 | | hrnet_w18 | 128 | 0.9999 | 0.9977 | 0.0 | 1.2841 | 1.4154 | 1.3772 | | dla102 | 128 | 1.0 | 1.0006 | 0.0 | 1.2837 | 1.3845 | 1.3656 | | volo_d1_224 | 64 | 0.9997 | 0.9944 | 0.8015 | 0.0 | 1.3839 | 1.3595 | | nfnet_l0 | 128 | 1.0002 | 0.7885 | 0.0 | 1.1046 | 1.3732 | 1.328 | | res2net50_14w_8s | 128 | 1.0 | 0.9987 | 0.0 | 1.2465 | 1.3575 | 1.3243 | | xcit_large_24_p8_224 | 5 | 1.0031 | 0.9828 | 0.7753 | 0.0 | 1.3543 | 1.3085 | | mobilenetv2_100 | 128 | 0.9663 | 0.9636 | 0.7066 | 1.2863 | 1.3405 | 1.3537 | | mobilenetv3_large_100 | 128 | 0.9653 | 0.9624 | 0.766 | 1.2862 | 1.337 | 1.3483 | | coat_lite_mini | 128 | 1.0 | 0.9829 | 0.8338 | 1.0761 | 1.3328 | 1.3192 | | adv_inception_v3 | 128 | 1.0001 | 0.9984 | 0.0 | 1.1297 | 1.3276 | 1.306 | | gluon_inception_v3 | 128 | 1.0 | 0.9984 | 0.0 | 1.1291 | 1.3271 | 1.3088 | | inception_v3 | 128 | 1.0 | 0.9986 | 0.0 | 1.1291 | 1.3264 | 1.3084 | | crossvit_9_240 | 128 | 0.9997 | 0.9942 | 0.7607 | 1.0384 | 1.3253 | 1.301 | | resnest101e | 64 | 0.9999 | 1.0029 | 0.0 | 1.169 | 1.313 | 1.2705 | | res2next50 | 128 | 0.9999 | 1.0012 | 0.0 | 1.1811 | 1.3101 | 1.2739 | | fbnetv3_b | 128 | 0.9648 | 0.961 | 0.7626 | 1.2422 | 1.2832 | 1.2992 | | jx_nest_base | 32 | 1.0 | 0.9962 | 0.7365 | 0.0 | 1.2773 | 1.2432 | | mnasnet_100 | 128 | 0.9663 | 0.9636 | 0.7862 | 1.2515 | 1.2666 | 1.2815 | | selecsls42b | 128 | 0.9997 | 0.999 | 0.8166 | 1.2149 | 1.266 | 1.2518 | | sebotnet33ts_256 | 64 | 0.9758 | 0.8047 | 0.0 | 0.0 | 1.2652 | 1.2683 | | eca_botnext26ts_256 | 128 | 0.9863 | 0.7719 | 0.0 | 0.0 | 1.2624 | 1.2527 | | eca_halonext26ts | 128 | 0.9875 | 0.7779 | 0.0 | 0.0 | 1.2591 | 1.2454 | | botnet26t_256 | 128 | 0.986 | 0.9817 | 0.7861 | 0.0 | 1.258 | 1.251 | | tf_efficientnet_b0 | 128 | 0.977 | 0.7835 | 0.0 | 1.1647 | 1.2562 | 1.2693 | | gmixer_24_224 | 128 | 1.0 | 0.8091 | 0.0 | 1.0442 | 1.2512 | 1.2386 | | fbnetc_100 | 128 | 0.9661 | 0.9627 | 0.792 | 1.2459 | 1.2485 | 1.2651 | | ese_vovnet19b_dw | 128 | 0.978 | 0.978 | 0.7452 | 1.1492 | 1.2426 | 1.2456 | | spnasnet_100 | 128 | 0.9621 | 0.9572 | 0.7746 | 1.2088 | 1.2379 | 1.2543 | | cspdarknet53 | 64 | 0.9588 | 0.9551 | 0.7379 | 1.1739 | 1.2305 | 1.2394 | | res2net101_26w_4s | 64 | 0.9997 | 0.9977 | 0.774 | 1.1065 | 1.2278 | 1.1886 | | rexnet_100 | 128 | 0.9725 | 0.8168 | 0.0 | 1.1608 | 1.2136 | 1.2184 | | pnasnet5large | 16 | 0.9995 | 0.9982 | 0.0 | 1.0895 | 1.2084 | 1.1933 | | twins_pcpvt_base | 64 | 0.9995 | 0.9982 | 0.7474 | 0.0 | 1.2058 | 1.176 | | convit_base | 64 | 0.9997 | 0.9976 | 0.0 | 0.0 | 1.2034 | 1.1958 | | gmlp_s16_224 | 128 | 0.9996 | 0.9497 | 0.0 | 1.0351 | 1.2023 | 1.1878 | | dpn107 | 32 | 0.9575 | 0.9507 | 0.778 | 1.0272 | 1.1925 | 1.2011 | | tinynet_a | 128 | 0.9659 | 0.7747 | 0.6211 | 1.1486 | 1.1912 | 1.2005 | | pit_b_224 | 64 | 1.0004 | 0.9988 | 0.0 | 1.0321 | 1.1868 | 1.1765 | | cait_m36_384 | 4 | 1.0003 | 1.0262 | 0.0 | 0.0 | 1.1796 | 1.1565 | | repvgg_a2 | 128 | 0.9652 | 0.9631 | 0.8241 | 1.1384 | 1.1726 | 1.1691 | | mobilevit_s | 64 | 0.9794 | 0.7617 | 0.0 | 0.0 | 1.1702 | 1.1651 | | tf_mixnet_l | 128 | 0.9852 | 0.8893 | 0.0 | 1.0947 | 1.1685 | 1.1662 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.0 | 0.0 | 1.1644 | 1.1468 | | mixnet_l | 128 | 0.9851 | 0.8858 | 0.0 | 1.0985 | 1.1511 | 1.149 | | swin_base_patch4_window7_224 | 64 | 1.0 | 0.9798 | 0.0 | 0.0 | 1.1413 | 1.1297 | | beit_base_patch16_224 | 64 | 0.9996 | 0.9819 | 0.0 | 0.0 | 1.1155 | 1.1031 | | swsl_resnext101_32x16d | 32 | 0.9998 | 0.9979 | 0.0 | 1.1086 | 1.1075 | 1.072 | | deit_base_distilled_patch16_224 | 64 | 0.9996 | 0.9981 | 0.7699 | 0.9808 | 1.0929 | 1.0806 | | gluon_xception65 | 32 | 0.9998 | 0.9972 | 0.0 | 1.0805 | 1.086 | 1.075 | | vit_base_patch16_224 | 64 | 1.0 | 0.9979 | 0.769 | 0.9514 | 1.0848 | 1.0735 | | convmixer_768_32 | 32 | 0.9999 | 0.9997 | 0.0 | 0.0 | 1.0775 | 1.0743 | | convnext_base | 64 | 1.0003 | 0.9988 | 0.0 | 0.0 | 1.0762 | 1.0779 | | gernet_l | 128 | 0.974 | 0.9725 | 0.8236 | 1.0983 | 1.0757 | 1.0723 | | mixer_b16_224 | 128 | 1.0 | 0.9792 | 0.0 | 0.8778 | 1.0687 | 1.0623 | | visformer_small | 128 | 1.0001 | 1.0016 | 0.7974 | 0.0 | 1.0445 | 1.0102 | | resmlp_12_224 | 128 | 1.0 | 0.8548 | 0.6124 | 1.0056 | 0.8188 | 0.8155 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 0.0 | 0.0 | 1.5389 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convit_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.6461 | 24.0839 | nan | 673.3057 | 99.0519 | 92.7813 | | swin_base_patch4_window7_224 | 64 | 2.5674 | 10.3212 | nan | nan | 75.5225 | 73.6368 | | mobilevit_s | 64 | 1.6089 | 6.0445 | nan | nan | 72.1763 | 71.3548 | | pnasnet5large | 16 | 4.5522 | 17.7938 | nan | 349.4073 | 71.379 | 66.4272 | | xcit_large_24_p8_224 | 5 | 2.658 | 13.7833 | 26.399 | nan | 71.1165 | 67.8736 | | twins_pcpvt_base | 64 | 2.137 | 10.6004 | 19.3549 | nan | 61.6084 | 60.2818 | | cait_m36_384 | 4 | 2.7922 | 14.9433 | nan | nan | 60.2762 | 58.832 | | convnext_base | 64 | 1.2811 | 5.0361 | nan | nan | 60.0876 | 57.5659 | | resnest101e | 64 | 2.991 | 12.5814 | nan | 282.3439 | 55.7929 | 54.048 | | jx_nest_base | 32 | 1.6631 | 7.5264 | 12.4356 | nan | 52.4175 | 51.9998 | | res2net101_26w_4s | 64 | 2.8719 | 13.2419 | 22.7396 | 247.9385 | 51.2294 | 48.326 | | sebotnet33ts_256 | 64 | 1.5883 | 5.1369 | nan | nan | 47.8314 | 45.7374 | | coat_lite_mini | 128 | 1.1402 | 4.1176 | 6.6556 | 93.4558 | 47.5204 | 47.1375 | | eca_halonext26ts | 128 | 1.3512 | 4.4368 | nan | nan | 46.6614 | 45.6541 | | res2net50_14w_8s | 128 | 2.5726 | 11.8755 | nan | 261.1243 | 46.5747 | 45.3549 | | poolformer_m36 | 64 | 1.8247 | 7.1318 | nan | nan | 44.5984 | 42.5445 | | eca_botnext26ts_256 | 128 | 1.2878 | 4.2465 | nan | nan | 39.3633 | 37.451 | | gmlp_s16_224 | 128 | 0.9728 | 5.1611 | nan | 154.4669 | 38.4799 | 36.8124 | | dpn107 | 32 | 3.9402 | 11.741 | 36.3292 | 169.275 | 37.1079 | 34.8475 | | botnet26t_256 | 128 | 1.3255 | 3.7026 | 8.9379 | nan | 35.8759 | 35.329 | | crossvit_9_240 | 128 | 1.3961 | 6.6179 | 10.4325 | 176.9135 | 35.7374 | 34.0672 | | volo_d1_224 | 64 | 1.1897 | 6.0436 | 9.9749 | nan | 35.2074 | 33.0707 | | fbnetv3_b | 128 | 3.0863 | 9.594 | 24.3095 | 234.0961 | 34.8062 | 32.4641 | | gluon_xception65 | 32 | 1.771 | 8.8104 | nan | 154.4863 | 34.5784 | 32.0421 | | gluon_inception_v3 | 128 | 1.5687 | 6.8062 | nan | 146.327 | 32.065 | 29.9128 | | ghostnet_100 | 128 | 2.6885 | 8.1195 | 11.8343 | 162.4053 | 31.8066 | 29.5013 | | adv_inception_v3 | 128 | 1.5424 | 6.6947 | nan | 141.0937 | 31.7185 | 30.4618 | | inception_v3 | 128 | 1.4921 | 6.755 | nan | 146.6623 | 31.6417 | 30.4386 | | tf_mixnet_l | 128 | 5.6576 | 11.3321 | nan | 155.1872 | 31.189 | 29.4845 | | mixnet_l | 128 | 5.3092 | 10.7561 | nan | 154.5902 | 29.8625 | 27.8725 | | gmixer_24_224 | 128 | 1.0528 | 5.8573 | nan | 137.5712 | 29.5316 | 27.9999 | | dla102 | 128 | 1.6675 | 7.5133 | nan | 177.2127 | 29.4793 | 28.4769 | | swsl_resnext101_32x16d | 32 | 1.6217 | 7.592 | nan | 125.9041 | 28.9292 | 27.0366 | | dm_nfnet_f0 | 128 | 2.0532 | 6.1889 | nan | 156.0689 | 28.5008 | 27.3986 | | convit_base | 64 | 1.0603 | 4.9419 | nan | nan | 28.0864 | 26.2091 | | res2next50 | 128 | 1.4776 | 6.5715 | nan | 159.8804 | 27.3983 | 25.6061 | | rexnet_100 | 128 | 1.7944 | 6.186 | nan | 143.3703 | 25.5497 | 24.1839 | | tinynet_a | 128 | 1.9687 | 6.6642 | 17.5077 | 143.0197 | 24.7893 | 23.3333 | | resmlp_12_224 | 128 | 0.6192 | 2.2697 | 3.922 | 32.4958 | 22.4314 | 21.3543 | | cspdarknet53 | 64 | 2.3176 | 6.2027 | 16.6579 | 119.4252 | 21.947 | 20.6265 | | tf_efficientnet_b0 | 128 | 1.7328 | 6.0425 | nan | 128.2825 | 21.8979 | 20.886 | | visformer_small | 128 | 0.9096 | 3.4279 | 5.4107 | nan | 21.5852 | 21.2123 | | mixer_b16_224 | 128 | 0.7324 | 2.7119 | nan | 73.837 | 21.2591 | 20.5871 | | convmixer_768_32 | 32 | 1.1509 | 4.8568 | nan | nan | 21.1924 | 19.6681 | | fbnetc_100 | 128 | 1.9999 | 5.4443 | 15.2768 | 107.9298 | 21.1183 | 19.6338 | | spnasnet_100 | 128 | 1.8961 | 5.4586 | 15.1921 | 112.2471 | 20.9848 | 19.347 | | nfnet_l0 | 128 | 1.7536 | 6.1228 | nan | 132.412 | 20.8487 | 19.448 | | beit_base_patch16_224 | 64 | 1.2459 | 4.216 | nan | nan | 19.329 | 18.2012 | | mobilenetv3_large_100 | 128 | 1.7028 | 4.7755 | 11.5878 | 121.1619 | 19.2069 | 18.2072 | | deit_base_distilled_patch16_224 | 64 | 0.8463 | 3.6103 | 5.5575 | 74.8667 | 19.0401 | 18.3408 | | vit_base_patch16_224 | 64 | 0.8281 | 3.4414 | 5.5484 | 73.1389 | 18.41 | 17.9216 | | mobilenetv2_100 | 128 | 1.6408 | 4.7199 | 11.7187 | 100.6769 | 18.0047 | 17.5493 | | repvgg_a2 | 128 | 1.9123 | 5.0811 | 13.9779 | 155.151 | 17.7554 | 16.9937 | | pit_b_224 | 64 | 0.8726 | 3.8313 | nan | 97.6379 | 17.5527 | 16.5443 | | mnasnet_100 | 128 | 1.6788 | 4.4519 | 11.8814 | 90.6505 | 17.529 | 16.6986 | | gernet_l | 128 | 1.8828 | 5.1022 | 13.5963 | 90.3277 | 17.3164 | 16.2639 | | regnety_002 | 128 | 1.5221 | 4.4927 | 11.3401 | 91.4548 | 17.0078 | 16.7753 | | selecsls42b | 128 | 0.7654 | 3.0297 | 4.8695 | 73.8437 | 15.3673 | 14.7897 | | lcnet_050 | 128 | 1.053 | 2.8161 | 7.037 | 65.9257 | 13.4091 | 11.9766 | | ese_vovnet19b_dw | 128 | 1.0985 | 2.5627 | 5.9381 | 54.6904 | 12.3632 | 11.5268 | | tnt_s_patch16_224 | 128 | 1.5489 | 8.4074 | nan | nan | nan | 31.3326 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 1.4758 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | 0.9898 | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.7759 | 1.2911 | 1.4945 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1918 | 1.1771 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.3401 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1184 | 1.1751 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0527 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.7593 | 1.0218 | 1.0961 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 1.2305 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 1.2337 | 0.9926 | 1.0799 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9924 | 1.0856 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 1.4125 | 0.9827 | 1.0539 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | 0.8574 | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9833 | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9831 | 0.9169 | nan | 0.953 | 0.9632 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8646 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2864 | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | 0.8933 | 0.9285 | 1.015 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0634 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0635 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0634 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9126 | 0.9981 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0516 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9069 | 1.0618 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | 0.8941 | 0.9059 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | 0.8676 | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8822 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8563 | 1.0753 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.9857 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.8133 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7462 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 1.1496 | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/zTzs6Kt.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/FmGxTBN.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/pB0JOmP.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 53/54 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 89%, 48/54 | 90%, 38/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 59%, 32/54 |  12%, 5/42  | 54%, 33/61  |
|        inductor        | 85%, 46/54 | 93%, 39/42  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 49/54 | 93%, 39/42  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.21x    |    1.11x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.01x    |    1.08x    |
|        inductor        |   1.89x    |    1.77x    |    1.41x    |
| inductor_no_cudagraphs |   1.36x    |    1.54x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.41    |    2.86     |    2.14     |
|       aot_eager        |    7.01    |    10.27    |    8.57     |
|     aot_cudagraphs     |   11.24    |    19.07    |    15.95    |
|    nvprims_nvfuser     |   70.53    |   158.06    |   147.88    |
|        inductor        |   32.72    |    34.86    |    39.84    |
| inductor_no_cudagraphs |   32.13    |    30.21    |    38.38    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.90x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.84x    |    1.06x    |    0.85x    |
|        inductor        |   0.85x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.96x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | dlrm | 0.0 | 1.1582 | | torchbench | hf_GPT2_large | 0.0 | 1.8635 | | torchbench | tacotron2 | 0.0 | 0.8663 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6781 | 0.6551 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | yolov3 | 408.5615 | 403.3238 | | torchbench | timm_efficientdet | 136.5363 | 133.8119 | | torchbench | tacotron2 | nan | 64.891 | | torchbench | hf_GPT2_large | nan | 51.6864 | | torchbench | dlrm | nan | 3.2999 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | eca_halonext26ts | nan | nan | +-------------+-----------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | timm_vision_transformer_large | 0.879 | 0.9542 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | hf_Bert | 0.8738 | 0.9419 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | shufflenet_v2_x1_0 | 0.869 | 0.993 | | torchbench | fastNLP_Bert | 0.8661 | 1.0682 | | torchbench | resnet50 | 0.8657 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8383 | 0.9051 | | torchbench | dcgan | 0.8283 | 0.9695 | | torchbench | hf_Bart | 0.8231 | 0.9873 | | torchbench | hf_BigBird | 0.8109 | 1.0963 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | pytorch_stargan | 0.7801 | 0.8859 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | resnext50_32x4d | 0.7647 | 0.775 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | LearningToPaint | 0.7295 | 0.9035 | | torchbench | soft_actor_critic | 0.7295 | 1.0367 | | torchbench | timm_vision_transformer | 0.7139 | 0.724 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | hf_Reformer | 0.5295 | 0.9885 | | torchbench | functorch_dp_cifar10 | 0.4478 | 0.4806 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | hf_GPT2_large | nan | 1.1354 | | torchbench | dlrm | nan | 0.7306 | | torchbench | tacotron2 | nan | 0.4113 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0179 | | huggingface | MegatronBertForCausalLM | 0.8918 | 1.0275 | | huggingface | PLBartForConditionalGeneration | 0.8848 | 1.028 | | huggingface | DistilBertForMaskedLM | 0.8803 | 0.948 | | huggingface | MT5ForConditionalGeneration | 0.8756 | 0.9197 | | huggingface | Speech2Text2ForCausalLM | 0.8691 | 0.9801 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | PLBartForCausalLM | 0.8546 | 0.9358 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | BigBird | 0.8172 | 1.0883 | | huggingface | CamemBert | 0.8062 | 0.9318 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9905 | | huggingface | DistillGPT2 | 0.8 | 1.0172 | | huggingface | YituTechConvBert | 0.7879 | 0.9269 | | huggingface | M2M100ForConditionalGeneration | 0.7829 | 1.0151 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9454 | | huggingface | MobileBertForQuestionAnswering | 0.6085 | 0.8221 | | huggingface | DebertaForMaskedLM | 0.4088 | 1.0667 | | huggingface | DebertaForQuestionAnswering | 0.307 | 1.1932 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | gluon_xception65 | 0.8973 | 0.9761 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | inception_v3 | 0.8973 | 1.0246 | | timm_models | gluon_inception_v3 | 0.8973 | 1.0246 | | timm_models | adv_inception_v3 | 0.8973 | 1.0246 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.8769 | 0.9736 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8703 | 1.0094 | | timm_models | gernet_l | 0.8617 | 0.9854 | | timm_models | cspdarknet53 | 0.8606 | 1.0105 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | coat_lite_mini | 0.8032 | 1.0344 | | timm_models | repvgg_a2 | 0.7908 | 0.9916 | | timm_models | resmlp_12_224 | 0.7876 | 0.8011 | | timm_models | swin_base_patch4_window7_224 | 0.7569 | 0.9259 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8292 | | timm_models | jx_nest_base | 0.6707 | 0.8617 | | timm_models | eca_halonext26ts | nan | nan | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/flpnImg.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/ZFxWq8p.png)

Accuracy Regressions

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0013 | 0.9143 | 2.4613 | 0.7234 | 5.9353 | 1.3193 | | functorch_dp_cifar10 | 64 | 1.0051 | 0.963 | 2.4235 | 0.0 | 4.8427 | 1.3847 | | timm_efficientdet | 1 | 0.9851 | 0.8041 | 2.1172 | 0.0 | 4.7172 | 1.5308 | | resnext50_32x4d | 8 | 1.002 | 0.9661 | 1.8702 | 0.74 | 3.6439 | 1.2729 | | BERT_pytorch | 16 | 1.0108 | 0.8407 | 1.5394 | 0.83 | 3.4892 | 2.3109 | | timm_vision_transformer | 8 | 1.0045 | 0.8417 | 1.7686 | 0.62 | 3.2864 | 1.5294 | | mobilenet_v3_large | 32 | 1.0026 | 0.9258 | 1.5098 | 0.7691 | 3.0206 | 1.4007 | | resnet18 | 16 | 0.9973 | 0.998 | 1.6325 | 0.8022 | 2.7582 | 1.25 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9991 | 0.9915 | 1.4923 | 0.0 | 2.7368 | 1.5823 | | mnasnet1_0 | 32 | 1.002 | 1.0199 | 1.2675 | 0.7681 | 2.5974 | 1.3447 | | hf_T5_large | 2 | 1.0258 | 0.8499 | 0.0 | 0.0 | 2.5841 | 2.1233 | | dcgan | 32 | 0.9886 | 0.916 | 1.6867 | 0.7173 | 2.564 | 1.0696 | | squeezenet1_1 | 32 | 0.9978 | 0.9698 | 1.483 | 0.7396 | 2.4327 | 1.3017 | | hf_Albert | 8 | 1.0007 | 0.9612 | 0.7731 | 0.0 | 2.3758 | 2.3184 | | drq | 1 | 1.0 | 0.8288 | 1.9592 | 0.6096 | 2.2569 | 1.1479 | | timm_efficientnet | 32 | 0.961 | 0.8089 | 1.0916 | 0.6895 | 2.1254 | 1.2684 | | pytorch_struct | 200 | 0.996 | 0.7492 | 1.016 | 0.6075 | 2.1171 | 1.2724 | | resnet152 | 32 | 1.0011 | 1.0054 | 1.2621 | 0.0 | 2.0826 | 1.2914 | | hf_Bert | 4 | 1.035 | 0.8572 | 0.95 | 0.0 | 2.0754 | 1.7975 | | lennard_jones | 1000 | 0.9756 | 0.771 | 1.2886 | 0.4915 | 2.046 | 1.0503 | | timm_resnest | 32 | 1.0032 | 1.0205 | 0.8419 | 0.9613 | 1.9298 | 1.6891 | | hf_GPT2 | 4 | 1.0185 | 0.9831 | 0.8161 | 0.2913 | 1.9139 | 1.9815 | | hf_T5 | 8 | 1.0019 | 0.9206 | 0.0 | 1.3518 | 1.8685 | 1.8759 | | LearningToPaint | 96 | 0.9998 | 1.0066 | 1.1306 | 0.8345 | 1.8521 | 1.302 | | shufflenet_v2_x1_0 | 128 | 1.0004 | 1.0171 | 0.9812 | 0.8598 | 1.8294 | 1.4109 | | resnet50 | 32 | 1.0018 | 1.0117 | 1.0506 | 0.8074 | 1.7724 | 1.3578 | | hf_Bart | 4 | 1.0143 | 0.8352 | 0.8804 | 0.0 | 1.754 | 1.71 | | soft_actor_critic | 256 | 1.0075 | 0.7479 | 1.3225 | 0.528 | 1.7387 | 1.0293 | | speech_transformer | 32 | 1.0063 | 0.8697 | 1.9104 | 0.0 | 1.6864 | 1.6715 | | pytorch_stargan | 16 | 0.9994 | 1.0995 | 1.0424 | 0.0 | 1.6221 | 1.3969 | | mobilenet_v2 | 96 | 0.9999 | 0.9888 | 0.7584 | 1.0351 | 1.576 | 1.5157 | | attention_is_all_you_need_pytorch | 256 | 1.0087 | 0.8834 | 0.8645 | 0.0 | 1.5687 | 1.4678 | | fastNLP_Bert | 6 | 0.9981 | 0.884 | 0.7665 | 0.0 | 1.5305 | 1.4692 | | hf_DistilBert | 8 | 0.9995 | 0.9529 | 0.742 | 0.368 | 1.5099 | 1.4884 | | timm_nfnet | 128 | 0.9989 | 0.9991 | 0.8807 | 0.9195 | 1.5065 | 1.432 | | pytorch_unet | 1 | 1.0001 | 0.9923 | 0.8631 | 1.0848 | 1.3589 | 1.3296 | | timm_regnet | 32 | 0.9802 | 0.9392 | 0.896 | 0.7883 | 1.334 | 1.2247 | | timm_vovnet | 32 | 0.9209 | 0.8906 | 0.8672 | 0.8047 | 1.3176 | 1.1528 | | vgg16 | 64 | 0.9996 | 0.9969 | 0.8574 | 0.9709 | 1.2722 | 1.2643 | | Background_Matting | 4 | 0.9997 | 1.019 | 0.8971 | 1.0573 | 1.2399 | 1.2257 | | Super_SloMo | 6 | 0.9997 | 0.9948 | 0.8851 | 0.0 | 1.2234 | 1.1967 | | alexnet | 128 | 0.9983 | 0.9966 | 0.8157 | 0.9314 | 1.2103 | 1.2083 | | hf_Reformer | 4 | 0.9985 | 0.9995 | 0.9909 | 0.6498 | 1.1776 | 1.1762 | | timm_vision_transformer_large | 8 | 1.0003 | 0.9902 | 0.0 | 0.0 | 1.1567 | 1.1371 | | hf_BigBird | 2 | 0.9881 | 0.9138 | 1.0396 | 0.8289 | 1.1469 | 1.0278 | | yolov3 | 16 | 0.9994 | 0.9898 | 0.804 | 0.0 | 1.0889 | 1.0652 | | tts_angular | 64 | 0.9662 | 0.9171 | 0.9799 | 0.958 | 1.0178 | 1.0249 | | demucs | 4 | 1.0034 | 1.0025 | 0.9999 | 1.0017 | 0.9964 | 0.9988 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9963 | 0.6971 | 1.0078 | 0.9895 | 1.0307 | | dlrm | 2048 | 0.0 | 1.058 | 0.0 | 0.0 | 0.0 | 1.1582 | | hf_GPT2_large | 4 | 1.0003 | 0.9854 | 0.0 | 0.0 | 0.0 | 1.8635 | | tacotron2 | 64 | 0.9807 | 0.7521 | 1.0162 | 0.5893 | 0.0 | 0.8663 | | hf_Longformer | 2 | 0.9428 | 0.861 | 0.8787 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9683 | 8.241 | 11.7909 | nan | 408.5615 | 403.3238 | | timm_efficientdet | 1 | 19.9819 | 37.2387 | 77.0195 | nan | 136.5363 | 133.8119 | | hf_T5_large | 2 | 14.3495 | 39.2594 | nan | nan | 114.5587 | 112.1593 | | timm_vision_transformer_large | 8 | 2.8766 | 15.0595 | nan | nan | 69.0787 | 66.936 | | densenet121 | 4 | 2.2607 | 11.7367 | 19.232 | 232.8928 | 47.3546 | 46.6786 | | resnet152 | 32 | 2.534 | 13.4485 | 21.7712 | nan | 47.1429 | 45.4634 | | hf_BigBird | 2 | 8.4355 | 15.8405 | 31.2643 | 120.3086 | 44.4102 | 29.4048 | | timm_resnest | 32 | 0.6071 | 2.4878 | 3.8611 | 66.7551 | 37.8958 | 37.9416 | | timm_vision_transformer | 8 | 0.9966 | 4.6303 | 6.7013 | 91.8614 | 35.1861 | 33.5447 | | attention_is_all_you_need_pytorch | 256 | 1.3139 | 7.3237 | 11.5167 | nan | 34.8751 | 33.6024 | | BERT_pytorch | 16 | 1.7164 | 7.6384 | 11.5031 | 140.1572 | 31.497 | 31.3874 | | timm_nfnet | 128 | 2.1033 | 7.1474 | 10.599 | 160.6484 | 31.2368 | 29.2635 | | hf_Bart | 4 | 2.0081 | 8.8603 | 13.7162 | nan | 31.105 | 30.5643 | | hf_T5 | 8 | 2.597 | 8.8275 | nan | 108.578 | 28.7369 | 28.0584 | | fastNLP_Bert | 6 | 1.7696 | 7.16 | 11.6444 | nan | 28.2736 | 26.3085 | | timm_regnet | 32 | 2.4957 | 7.9737 | 19.839 | 142.9489 | 27.1954 | 25.756 | | speech_transformer | 32 | 1.9676 | 9.0147 | 32.9951 | nan | 26.1487 | 25.6195 | | timm_efficientnet | 32 | 1.8433 | 6.6896 | 15.6884 | 152.3246 | 25.9975 | 25.6423 | | mobilenet_v3_large | 32 | 0.9724 | 4.7325 | 7.2469 | 118.8263 | 24.9647 | 25.0993 | | pytorch_stargan | 16 | 0.4243 | 1.9674 | 2.9079 | nan | 23.2632 | 26.7215 | | pytorch_struct | 200 | 0.2802 | 0.8563 | 1.4785 | 7.9045 | 22.397 | 23.4428 | | hf_Bert | 4 | 1.7798 | 7.1746 | 10.3723 | nan | 20.4873 | 19.9215 | | mnasnet1_0 | 32 | 0.8765 | 4.3217 | 6.5472 | 89.7441 | 20.4835 | 19.8927 | | shufflenet_v2_x1_0 | 128 | 1.0041 | 5.0131 | 7.5717 | 106.8545 | 20.1009 | 19.2699 | | resnext50_32x4d | 8 | 0.9734 | 4.4852 | 6.6921 | 82.0702 | 19.8657 | 18.483 | | hf_Albert | 8 | 1.5614 | 6.6491 | 10.2034 | nan | 19.6679 | 18.7336 | | hf_Reformer | 4 | 1.7626 | 3.2147 | 6.1837 | 17.7111 | 19.6241 | 16.0934 | | timm_vovnet | 32 | 1.5127 | 4.3369 | 9.9771 | 70.5539 | 19.5896 | 18.7436 | | hf_GPT2 | 4 | 1.618 | 6.3323 | 9.6397 | 112.1649 | 19.5515 | 19.6399 | | resnet50 | 32 | 0.9217 | 4.4829 | 6.8697 | 100.0485 | 19.5514 | 19.1127 | | mobilenet_v2 | 96 | 0.871 | 4.5275 | 7.2876 | 114.595 | 19.4609 | 18.1582 | | Background_Matting | 4 | 0.8946 | 4.2399 | 6.4472 | 91.2248 | 18.3556 | 16.8696 | | Super_SloMo | 6 | 0.9132 | 4.142 | 5.6664 | nan | 16.5743 | 16.1601 | | hf_DistilBert | 8 | 0.7923 | 3.757 | 5.671 | 64.1914 | 13.3724 | 12.8186 | | functorch_dp_cifar10 | 64 | 0.3918 | 1.6664 | 2.4586 | nan | 13.1711 | 13.0904 | | resnet18 | 16 | 0.4449 | 1.7701 | 2.6148 | 40.7007 | 11.2892 | 11.2029 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4391 | 2.0291 | 2.8212 | nan | 8.9555 | 8.6536 | | pytorch_unet | 1 | 0.4071 | 1.7513 | 2.7372 | 38.6667 | 8.1409 | 7.8657 | | LearningToPaint | 96 | 0.4627 | 1.8662 | 2.8708 | 46.4804 | 7.8246 | 7.5586 | | squeezenet1_1 | 32 | 0.2674 | 0.9457 | 1.3995 | 7.1057 | 4.6559 | 4.4427 | | drq | 1 | 0.3149 | 0.6454 | 1.0303 | 6.0259 | 4.3346 | 3.723 | | vgg16 | 64 | 0.2079 | 0.6807 | 1.0946 | 5.7302 | 4.1146 | 3.8305 | | nvidia_deeprecommender | 256 | 0.2169 | 0.5221 | 0.8216 | 5.8378 | 3.5945 | 3.3841 | | soft_actor_critic | 256 | 0.213 | 0.3696 | 0.5914 | 2.7862 | 3.4835 | 2.8948 | | alexnet | 128 | 0.1728 | 0.4407 | 0.7233 | 5.0283 | 3.2551 | 3.0862 | | dcgan | 32 | 0.1772 | 0.4128 | 0.6922 | 5.0272 | 2.87 | 2.6328 | | lennard_jones | 1000 | 0.1609 | 0.3464 | 0.5304 | 3.0588 | 2.1425 | 1.9813 | | tts_angular | 64 | 0.2296 | 0.2759 | 0.4074 | 1.5121 | 1.9634 | 1.739 | | demucs | 4 | 0.3544 | 0.3527 | 0.3528 | 0.3471 | 0.2703 | 0.259 | | tacotron2 | 64 | 18.5059 | 32.515 | 50.014 | 107.9253 | nan | 64.891 | | hf_GPT2_large | 4 | 5.7904 | 20.3097 | nan | nan | nan | 51.6864 | | dlrm | 2048 | nan | 0.85 | nan | nan | nan | 3.2999 | | hf_Longformer | 2 | 6.6687 | 14.909 | 58.3523 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2717 | 0.4638 | 1.2042 | 1.2318 | | hf_Albert | 8 | 0.9813 | 0.9356 | 0.3268 | nan | 1.1573 | 1.4691 | | speech_transformer | 32 | 1.0001 | 0.9165 | 0.3313 | nan | 1.1082 | 1.116 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0021 | 0.9646 | 0.3844 | nan | 1.0546 | 1.3033 | | timm_nfnet | 128 | 0.9695 | 0.8983 | 0.3558 | 0.4818 | 1.0336 | 1.1302 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.0179 | 1.1759 | | Background_Matting | 4 | 1.0139 | 0.9632 | 0.3723 | 0.9773 | 0.9918 | 1.0432 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | 0.9869 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3082 | nan | 0.9826 | 1.0135 | | BERT_pytorch | 16 | 0.9993 | 0.8806 | 0.3994 | 1.1037 | 0.9736 | 1.1219 | | hf_GPT2 | 4 | 0.9707 | 0.8847 | 0.3799 | 1.118 | 0.9648 | 1.1245 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9935 | 0.878 | 0.421 | nan | 0.9479 | 1.0517 | | timm_regnet | 32 | 0.9954 | 0.8452 | 0.3493 | 0.8032 | 0.9348 | 1.0309 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 1.014 | 0.9309 | 1.2521 | | resnet152 | 32 | 0.9929 | 0.8948 | 0.3627 | nan | 0.9119 | 0.9392 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9902 | 0.8373 | 0.3534 | nan | 0.9056 | 1.0455 | | timm_vision_transformer_large | 8 | 0.9975 | 0.8359 | nan | nan | 0.879 | 0.9542 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | hf_Bert | 4 | 0.9844 | 0.8759 | 0.3904 | nan | 0.8738 | 0.9419 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.869 | 0.993 | | fastNLP_Bert | 6 | 1.0028 | 0.8977 | 0.3702 | nan | 0.8661 | 1.0682 | | resnet50 | 32 | 0.9898 | 0.8628 | 0.3563 | 0.7814 | 0.8657 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9504 | 0.8808 | 0.3411 | 1.0621 | 0.8383 | 0.9051 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.8283 | 0.9695 | | hf_Bart | 4 | 0.9104 | 0.8318 | 0.3636 | nan | 0.8231 | 0.9873 | | hf_BigBird | 2 | 0.984 | 0.9787 | 0.4544 | 1.2209 | 0.8109 | 1.0963 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | pytorch_stargan | 16 | 0.9955 | 0.9766 | 0.4263 | nan | 0.7801 | 0.8859 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.341 | 0.7755 | 0.7799 | 0.8875 | | resnext50_32x4d | 8 | 0.9951 | 0.8553 | 0.3889 | 0.81 | 0.7647 | 0.775 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.7295 | 0.9035 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9302 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9938 | 0.8822 | 0.3905 | 1.0823 | 0.7139 | 0.724 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | hf_Reformer | 4 | 0.9861 | 0.9861 | 0.5889 | 0.9861 | 0.5295 | 0.9885 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | 0.4478 | 0.4806 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | hf_GPT2_large | 4 | 0.9581 | 0.8718 | nan | nan | nan | 1.1354 | | dlrm | 2048 | nan | 0.7305 | nan | nan | nan | 0.7306 | | tacotron2 | 64 | 0.9862 | 0.3962 | 0.3142 | 0.3829 | nan | 0.4113 | | hf_Longformer | 2 | 0.9731 | 0.9666 | 0.3489 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0283 | 0.8209 | 2.2712 | 0.0 | 4.6529 | 1.5853 | | MobileBertForMaskedLM | 32 | 1.0193 | 0.8168 | 1.7675 | 0.0 | 4.3902 | 1.7169 | | MT5ForConditionalGeneration | 8 | 1.015 | 0.8634 | 1.7972 | 0.9131 | 3.9337 | 2.4465 | | MobileBertForQuestionAnswering | 64 | 1.0182 | 0.8198 | 1.7296 | 0.0 | 3.7831 | 1.8091 | | CamemBert | 1 | 1.0357 | 0.8351 | 1.735 | 0.0 | 3.5087 | 1.7792 | | DistillGPT2 | 1 | 1.0314 | 0.862 | 1.2596 | 0.0 | 2.6301 | 1.9811 | | MegatronBertForQuestionAnswering | 16 | 1.0354 | 0.85 | 1.0384 | 0.0 | 2.3407 | 1.7773 | | GPT2ForSequenceClassification | 4 | 1.0013 | 0.9756 | 0.0 | 0.5005 | 2.3157 | 2.2892 | | M2M100ForConditionalGeneration | 8 | 1.0945 | 0.899 | 1.2063 | 0.7128 | 2.2516 | 1.7527 | | ElectraForQuestionAnswering | 64 | 1.0011 | 0.9676 | 0.7664 | 0.0 | 2.1232 | 2.0631 | | PLBartForConditionalGeneration | 16 | 1.0109 | 0.8356 | 1.0338 | 0.0 | 1.9493 | 1.7154 | | LayoutLMForSequenceClassification | 16 | 1.0005 | 0.9795 | 0.7753 | 0.0 | 1.854 | 1.81 | | MegatronBertForCausalLM | 16 | 1.0336 | 0.8538 | 0.9756 | 0.0 | 1.8268 | 1.729 | | ElectraForCausalLM | 32 | 1.0007 | 0.9395 | 0.7122 | 0.0 | 1.8247 | 1.8264 | | XGLMForCausalLM | 8 | 1.0119 | 0.8246 | 0.9136 | 0.0 | 1.7687 | 1.5802 | | MBartForConditionalGeneration | 16 | 1.0072 | 0.838 | 0.902 | 0.0 | 1.6765 | 1.6075 | | PegasusForConditionalGeneration | 16 | 1.0118 | 0.825 | 0.9335 | 0.6255 | 1.672 | 1.5591 | | AlbertForQuestionAnswering | 4 | 1.0003 | 0.8853 | 0.0 | 0.0 | 1.6683 | 1.658 | | AlbertForMaskedLM | 4 | 1.0005 | 0.8848 | 0.0 | 0.0 | 1.6575 | 1.6483 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9703 | 0.7565 | 0.0 | 1.6488 | 1.638 | | T5Small | 1 | 1.0223 | 0.8733 | 1.3578 | 0.8383 | 1.6403 | 1.4185 | | T5ForConditionalGeneration | 4 | 0.9988 | 0.9145 | 0.7524 | 1.0801 | 1.6159 | 1.6005 | | Speech2Text2ForCausalLM | 128 | 1.0014 | 0.9296 | 0.7168 | 0.8068 | 1.5788 | 1.5767 | | OPTForCausalLM | 32 | 1.0094 | 0.9279 | 0.7732 | 0.3356 | 1.5566 | 1.5338 | | BartForConditionalGeneration | 2 | 1.0048 | 0.9689 | 0.0 | 0.0 | 1.5257 | 1.4214 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.9826 | 0.7789 | 0.0 | 1.5063 | 1.4629 | | BertForQuestionAnswering | 128 | 0.9999 | 0.9835 | 0.7766 | 0.0 | 1.4943 | 1.471 | | DistilBertForQuestionAnswering | 64 | 1.0008 | 0.9667 | 0.741 | 0.3582 | 1.4924 | 1.4447 | | RobertaForCausalLM | 64 | 1.0005 | 0.9583 | 0.7532 | 0.0 | 1.4343 | 1.4209 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0082 | 0.9221 | 0.7289 | 0.0 | 1.431 | 1.427 | | BartForCausalLM | 4 | 1.0009 | 0.9684 | 0.7571 | 0.0 | 1.4228 | 1.4401 | | BertForMaskedLM | 64 | 1.0001 | 0.9566 | 0.7404 | 0.0 | 1.3584 | 1.3367 | | DebertaForMaskedLM | 4 | 0.9093 | 0.7234 | 0.8001 | 0.0 | 1.332 | 1.1314 | | PLBartForCausalLM | 32 | 1.0071 | 0.9414 | 0.7908 | 0.8366 | 1.2868 | 1.2805 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9491 | 0.7095 | 0.4648 | 1.2641 | 1.2656 | | BlenderbotSmallForCausalLM | 64 | 1.0049 | 0.9245 | 0.7171 | 0.0 | 1.2604 | 1.2739 | | MBartForCausalLM | 32 | 1.0039 | 0.9448 | 0.7544 | 0.0 | 1.1942 | 1.1894 | | TrOCRForCausalLM | 32 | 1.0012 | 0.9463 | 0.7553 | 0.0 | 1.1928 | 1.1921 | | PegasusForCausalLM | 32 | 1.0031 | 0.9506 | 0.7479 | 0.8503 | 1.1841 | 1.1739 | | BigBird | 1 | 0.9867 | 0.9057 | 1.0378 | 0.8228 | 1.1474 | 1.0224 | | DebertaForQuestionAnswering | 8 | 0.985 | 0.8585 | 0.7223 | 0.0 | 1.1412 | 1.2236 | | AllenaiLongformerBase | 1 | 0.925 | 0.7155 | 0.8529 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BigBird | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.2088 | 11.437 | 37.6911 | nan | 102.1539 | 37.8724 | | DebertaForMaskedLM | 4 | 5.1571 | 11.4838 | 36.6857 | nan | 99.1339 | 36.8513 | | XGLMForCausalLM | 8 | 3.0515 | 13.6823 | 27.8041 | nan | 74.9418 | 73.3306 | | M2M100ForConditionalGeneration | 8 | 3.3688 | 15.9636 | 28.6209 | 419.7935 | 68.7312 | 63.5356 | | MobileBertForMaskedLM | 32 | 9.9427 | 32.9793 | 59.6868 | nan | 67.2725 | 67.0575 | | MobileBertForQuestionAnswering | 64 | 9.7579 | 32.3335 | 59.2234 | nan | 65.5569 | 66.7489 | | PegasusForConditionalGeneration | 16 | 3.5588 | 16.9255 | 27.7711 | 457.2228 | 54.3855 | 50.5781 | | BartForConditionalGeneration | 2 | 3.6285 | 16.9109 | nan | nan | 51.1627 | 49.176 | | MBartForConditionalGeneration | 16 | 3.7892 | 17.3595 | 29.3931 | nan | 50.1189 | 48.8046 | | BigBird | 1 | 8.3395 | 15.5172 | 30.3948 | 126.6826 | 45.9929 | 28.2124 | | YituTechConvBert | 1 | 2.6543 | 10.6431 | 16.5651 | nan | 42.9189 | 40.2166 | | MegatronBertForCausalLM | 16 | 3.7221 | 14.5387 | 22.952 | nan | 41.5039 | 39.1658 | | MegatronBertForQuestionAnswering | 16 | 3.7053 | 14.3555 | 22.2307 | nan | 39.829 | 38.3252 | | MT5ForConditionalGeneration | 8 | 3.8542 | 13.0701 | 21.6181 | 178.2102 | 37.0391 | 35.3967 | | BlenderbotSmallForConditionalGeneration | 64 | 2.343 | 11.1922 | 18.446 | nan | 34.6394 | 32.4857 | | T5Small | 1 | 2.5585 | 8.8032 | 13.4044 | 111.7065 | 31.3935 | 31.1494 | | PLBartForConditionalGeneration | 16 | 1.9289 | 8.8767 | 13.7239 | nan | 29.7181 | 29.0378 | | T5ForConditionalGeneration | 4 | 2.5542 | 8.8951 | 13.4463 | 111.6039 | 29.6634 | 29.0923 | | LayoutLMForSequenceClassification | 16 | 2.2323 | 7.6437 | 11.5874 | nan | 27.7836 | 27.3179 | | ElectraForCausalLM | 32 | 1.8771 | 7.3393 | 11.3227 | nan | 26.8822 | 24.9467 | | PegasusForCausalLM | 32 | 1.5156 | 6.6333 | 10.5665 | 134.9342 | 24.847 | 23.3933 | | MBartForCausalLM | 32 | 1.4847 | 6.7245 | 9.9521 | nan | 23.55 | 22.4366 | | TrOCRForCausalLM | 32 | 1.3701 | 6.732 | 9.9929 | nan | 22.8899 | 22.0094 | | LayoutLMForMaskedLM | 16 | 2.2666 | 7.6621 | 11.9106 | nan | 22.4475 | 21.4893 | | BartForCausalLM | 4 | 1.4483 | 6.6308 | 9.9037 | nan | 22.4189 | 21.2815 | | BertForMaskedLM | 64 | 1.7426 | 7.1752 | 10.5381 | nan | 22.2674 | 20.9198 | | RobertaForCausalLM | 64 | 1.7625 | 7.142 | 10.5143 | nan | 21.6834 | 21.2939 | | ElectraForQuestionAnswering | 64 | 1.8292 | 7.3014 | 10.8215 | nan | 21.5446 | 20.7054 | | OPTForCausalLM | 32 | 1.4458 | 6.7023 | 11.6706 | 130.2243 | 21.5352 | 20.9027 | | BertForQuestionAnswering | 128 | 1.7078 | 7.1073 | 10.7065 | nan | 20.8987 | 19.7028 | | CamemBert | 1 | 1.8344 | 7.1991 | 10.7245 | nan | 19.9324 | 19.1592 | | RobertaForQuestionAnswering | 128 | 1.8069 | 7.2125 | 10.4773 | nan | 19.8234 | 19.169 | | GPT2ForSequenceClassification | 4 | 1.6916 | 6.4789 | nan | 116.5832 | 19.3108 | 18.9882 | | AlbertForQuestionAnswering | 4 | 1.7292 | 6.9278 | nan | nan | 19.05 | 17.3085 | | AlbertForMaskedLM | 4 | 1.4822 | 6.5828 | nan | nan | 18.9285 | 17.9323 | | BlenderbotSmallForCausalLM | 64 | 0.9855 | 4.4326 | 6.5234 | nan | 16.9271 | 16.2415 | | Speech2Text2ForCausalLM | 128 | 0.8421 | 3.5026 | 5.7136 | 61.4215 | 16.0442 | 14.5215 | | PLBartForCausalLM | 32 | 0.7863 | 3.4837 | 5.1875 | 74.536 | 14.9784 | 14.5567 | | DistillGPT2 | 1 | 0.93 | 3.3543 | 4.8702 | nan | 13.8349 | 13.3096 | | DistilBertForMaskedLM | 64 | 0.7548 | 3.5379 | 6.1981 | 63.0451 | 13.1251 | 12.1574 | | DistilBertForQuestionAnswering | 64 | 0.7718 | 3.5379 | 6.1404 | 68.8689 | 12.3363 | 11.7455 | | AllenaiLongformerBase | 1 | 6.8211 | 15.482 | 59.5875 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9165 | nan | 1.1872 | 1.0776 | 1.163 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0045 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.975 | 0.9115 | 0.4176 | 1.1001 | 0.9709 | 1.0364 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | 0.3625 | 1.0964 | 0.9662 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8919 | 0.396 | nan | 0.9593 | 1.1105 | | T5Small | 1 | 1.0 | 0.8865 | 0.3606 | 0.9885 | 0.9567 | 1.1277 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3661 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0001 | 0.8924 | 0.3996 | nan | 0.9418 | 1.0115 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.9293 | 0.9792 | | RobertaForCausalLM | 64 | 0.999 | 0.8994 | 0.3787 | nan | 0.9289 | 0.9788 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3469 | 1.0551 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 0.9999 | 0.868 | 0.3725 | 1.0332 | 0.9249 | 1.0061 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0001 | 0.8922 | 0.3997 | nan | 0.9211 | 0.9878 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 0.9997 | 0.8597 | 0.4044 | nan | 0.8918 | 1.0275 | | PLBartForConditionalGeneration | 16 | 0.9984 | 0.9002 | 0.4147 | nan | 0.8848 | 1.028 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.8599 | 0.3635 | 1.0791 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.9197 | 0.8304 | 0.4068 | 0.9197 | 0.8756 | 0.9197 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8196 | 0.3532 | 1.0437 | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.856 | 0.9327 | | PLBartForCausalLM | 32 | 1.0 | 0.844 | 0.3977 | 0.9947 | 0.8546 | 0.9358 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8173 | 0.3687 | nan | 0.846 | 0.9426 | | BigBird | 1 | 1.0009 | 0.9523 | 0.4477 | 1.1336 | 0.8172 | 1.0883 | | CamemBert | 1 | 0.9998 | 0.815 | 0.4163 | nan | 0.8062 | 0.9318 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8055 | 0.9905 | | DistillGPT2 | 1 | 0.9964 | 0.7985 | 0.4005 | nan | 0.8 | 1.0172 | | YituTechConvBert | 1 | 0.9711 | 0.8662 | 0.4304 | nan | 0.7879 | 0.9269 | | M2M100ForConditionalGeneration | 8 | 1.0018 | 0.9592 | 0.4275 | 1.0442 | 0.7829 | 1.0151 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9983 | 0.9818 | 0.3622 | nan | 0.4088 | 1.0667 | | DebertaForQuestionAnswering | 8 | 0.9753 | 1.0735 | 0.3251 | nan | 0.307 | 1.1932 | | AllenaiLongformerBase | 1 | 0.9995 | 0.9481 | 0.3846 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 1.0037 | 0.9775 | 0.8878 | 1.0107 | 2.1806 | 1.7861 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9983 | 0.0 | 0.0 | 2.1294 | 2.0904 | | regnety_002 | 128 | 0.9766 | 0.9418 | 1.1435 | 0.8558 | 2.092 | 1.4443 | | xcit_large_24_p8_224 | 5 | 1.002 | 0.0 | 0.0 | 0.0 | 2.0312 | 1.7805 | | lcnet_050 | 128 | 0.9685 | 0.9486 | 0.8532 | 1.0167 | 2.0302 | 1.6262 | | twins_pcpvt_base | 64 | 1.0037 | 0.9173 | 0.9554 | 0.0 | 1.9568 | 1.7139 | | coat_lite_mini | 128 | 0.9997 | 0.9699 | 0.8292 | 1.1503 | 1.7563 | 1.7288 | | hrnet_w18 | 128 | 1.004 | 1.0182 | 0.8596 | 0.0 | 1.6629 | 1.4342 | | res2net101_26w_4s | 64 | 1.0029 | 1.0057 | 0.9581 | 0.0 | 1.5964 | 1.3388 | | resnest101e | 64 | 0.9995 | 0.9905 | 0.8114 | 0.0 | 1.5913 | 1.4268 | | volo_d1_224 | 64 | 1.0002 | 0.9916 | 0.8469 | 0.0 | 1.5903 | 1.562 | | dla102 | 128 | 1.0001 | 0.9964 | 0.8304 | 1.3157 | 1.579 | 1.5475 | | nfnet_l0 | 128 | 0.9993 | 0.8097 | 0.7111 | 0.8501 | 1.5557 | 1.4679 | | gmlp_s16_224 | 128 | 0.9997 | 0.9422 | 0.7525 | 0.9872 | 1.5533 | 1.5175 | | swin_base_patch4_window7_224 | 64 | 1.0001 | 0.9546 | 0.0 | 0.0 | 1.5351 | 1.5139 | | gluon_inception_v3 | 128 | 0.9999 | 0.9933 | 0.855 | 1.1429 | 1.5067 | 1.4709 | | inception_v3 | 128 | 1.0 | 0.9963 | 0.8546 | 1.1435 | 1.5044 | 1.47 | | adv_inception_v3 | 128 | 0.9999 | 0.9962 | 0.8551 | 1.1417 | 1.5038 | 1.4706 | | dm_nfnet_f0 | 128 | 0.9989 | 0.999 | 0.8781 | 0.9218 | 1.4979 | 1.4282 | | gmixer_24_224 | 128 | 0.9998 | 0.842 | 0.6963 | 0.9183 | 1.4973 | 1.4776 | | res2net50_14w_8s | 128 | 1.0013 | 0.9936 | 0.8126 | 1.0024 | 1.4667 | 1.4253 | | cait_m36_384 | 4 | 0.9997 | 1.0088 | 0.0 | 0.0 | 1.4613 | 1.409 | | mobilenetv3_large_100 | 128 | 0.9686 | 0.9441 | 0.7833 | 0.9876 | 1.4516 | 1.429 | | crossvit_9_240 | 128 | 1.0001 | 0.994 | 0.8381 | 0.9195 | 1.4467 | 1.4167 | | selecsls42b | 128 | 0.9997 | 0.9954 | 0.8412 | 1.2827 | 1.4418 | 1.4123 | | mnasnet_100 | 128 | 0.9528 | 0.9447 | 0.7903 | 1.206 | 1.4304 | 1.456 | | convit_base | 64 | 0.9998 | 0.9962 | 0.8334 | 1.2334 | 1.4184 | 1.3422 | | res2next50 | 128 | 0.9998 | 0.9954 | 0.8332 | 1.1456 | 1.4169 | 1.3472 | | fbnetv3_b | 128 | 0.9727 | 0.9421 | 0.775 | 0.0 | 1.4058 | 1.4019 | | mobilenetv2_100 | 128 | 0.952 | 0.942 | 0.7233 | 1.1475 | 1.4023 | 1.4173 | | jx_nest_base | 32 | 0.9996 | 0.9919 | 0.802 | 0.0 | 1.3867 | 1.3548 | | ese_vovnet19b_dw | 128 | 0.9711 | 0.9648 | 0.7693 | 1.1296 | 1.3757 | 1.3782 | | spnasnet_100 | 128 | 0.9473 | 0.9381 | 0.7795 | 1.0973 | 1.3696 | 1.3922 | | mobilevit_s | 64 | 0.9737 | 0.8145 | 0.6568 | 0.0 | 1.3681 | 1.3584 | | fbnetc_100 | 128 | 0.9529 | 0.9437 | 0.7936 | 1.1604 | 1.3507 | 1.3721 | | tf_efficientnet_b0 | 128 | 0.9659 | 0.8074 | 0.6679 | 0.9517 | 1.3404 | 1.3547 | | cspdarknet53 | 64 | 0.9438 | 0.9343 | 0.7508 | 1.1281 | 1.3282 | 1.3416 | | poolformer_m36 | 64 | 0.9998 | 0.9975 | 0.8066 | 0.0 | 1.3243 | 1.2945 | | pit_b_224 | 64 | 0.9996 | 0.995 | 0.821 | 0.967 | 1.3221 | 1.3162 | | pnasnet5large | 16 | 1.0056 | 1.035 | 0.8516 | 0.0 | 1.3177 | 1.2672 | | botnet26t_256 | 128 | 0.9791 | 0.9709 | 0.8094 | 1.2721 | 1.2963 | 1.293 | | beit_base_patch16_224 | 64 | 1.0 | 0.9777 | 0.0 | 0.0 | 1.2796 | 1.2714 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9913 | 0.7969 | 0.9758 | 1.2793 | 1.2586 | | rexnet_100 | 128 | 0.9659 | 0.8502 | 0.691 | 0.0 | 1.2784 | 1.2761 | | eca_botnext26ts_256 | 128 | 0.9808 | 0.8101 | 0.6709 | 1.0705 | 1.2758 | 1.2812 | | mixer_b16_224 | 128 | 0.9998 | 0.9585 | 0.7771 | 0.8858 | 1.2754 | 1.2647 | | tinynet_a | 128 | 0.9524 | 0.8006 | 0.6565 | 0.7881 | 1.2752 | 1.3225 | | visformer_small | 128 | 0.9998 | 1.0015 | 0.8415 | 0.0 | 1.2313 | 1.1784 | | sebotnet33ts_256 | 64 | 0.9669 | 0.8331 | 0.6785 | 0.967 | 1.2113 | 1.2031 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9934 | 0.8352 | 0.9104 | 1.1959 | 1.1782 | | tf_mixnet_l | 128 | 0.9807 | 0.9074 | 0.794 | 0.0 | 1.1774 | 1.1693 | | gluon_xception65 | 32 | 0.9998 | 0.988 | 0.7559 | 0.0 | 1.1582 | 1.1237 | | mixnet_l | 128 | 0.9799 | 0.9057 | 0.7958 | 0.0 | 1.1574 | 1.1534 | | dpn107 | 32 | 0.9409 | 0.9293 | 0.7524 | 0.0 | 1.1549 | 1.1729 | | repvgg_a2 | 128 | 0.9439 | 0.9345 | 0.7997 | 1.0702 | 1.1397 | 1.1575 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9815 | 0.8 | 0.0 | 1.1346 | 1.0572 | | resmlp_12_224 | 128 | 1.0004 | 1.0092 | 0.7907 | 1.1855 | 1.0991 | 1.0917 | | gernet_l | 128 | 0.9479 | 0.9368 | 0.7702 | 1.0641 | 1.0677 | 1.0762 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9233 | 0.0 | 1.0553 | 1.0504 | | convnext_base | 64 | 0.9995 | 0.9955 | 0.802 | 0.0 | 0.6781 | 0.6551 | | eca_halonext26ts | 128 | 0.9799 | 0.8168 | 0.6796 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.3085 | 29.9091 | 56.6585 | nan | 118.1136 | 116.8053 | | twins_pcpvt_base | 64 | 2.7471 | 14.7731 | 26.3841 | nan | 93.5793 | 95.2427 | | pnasnet5large | 16 | 5.0832 | 22.6857 | 41.9815 | nan | 83.1385 | 80.1247 | | xcit_large_24_p8_224 | 5 | 3.2893 | nan | nan | nan | 79.1417 | 75.9183 | | swin_base_patch4_window7_224 | 64 | 2.965 | 13.1038 | nan | nan | 78.8565 | 76.1423 | | mobilevit_s | 64 | 1.9051 | 7.4692 | 15.5064 | nan | 73.0976 | 73.2342 | | convnext_base | 64 | 1.5206 | 7.0227 | 11.2208 | nan | 70.8184 | 69.4925 | | cait_m36_384 | 4 | 3.5048 | 19.2475 | nan | nan | 69.6079 | 66.7283 | | resnest101e | 64 | 3.4406 | 16.099 | 27.1566 | nan | 67.4582 | 66.1322 | | res2net101_26w_4s | 64 | 3.2414 | 16.6914 | 27.8252 | nan | 60.5387 | 58.1036 | | jx_nest_base | 32 | 1.7805 | 9.2988 | 14.9937 | nan | 57.8331 | 56.4979 | | res2net50_14w_8s | 128 | 2.8973 | 14.2348 | 24.6296 | 342.3313 | 55.4302 | 54.3195 | | coat_lite_mini | 128 | 1.3156 | 5.2696 | 8.5313 | 112.8405 | 52.8856 | 51.1333 | | sebotnet33ts_256 | 64 | 1.7616 | 6.1788 | 13.3757 | 153.9154 | 51.6583 | 51.3481 | | poolformer_m36 | 64 | 1.895 | 8.1513 | 13.3492 | nan | 47.9356 | 45.4173 | | gmlp_s16_224 | 128 | 1.3616 | 7.3953 | 12.1081 | 205.736 | 43.3975 | 42.0514 | | dpn107 | 32 | 4.1498 | 13.3923 | 39.0746 | nan | 43.0693 | 40.9737 | | eca_botnext26ts_256 | 128 | 1.3287 | 4.7118 | 10.3654 | 119.0512 | 42.0182 | 40.6943 | | volo_d1_224 | 64 | 1.3372 | 7.636 | 12.0736 | nan | 41.161 | 38.3224 | | fbnetv3_b | 128 | 3.4026 | 11.4408 | 27.8941 | nan | 41.0082 | 39.2 | | crossvit_9_240 | 128 | 1.6936 | 8.6693 | 13.4291 | 198.5554 | 40.8484 | 39.0347 | | gluon_xception65 | 32 | 2.0439 | 10.577 | 17.6545 | nan | 40.2657 | 38.4401 | | botnet26t_256 | 128 | 1.4904 | 4.3044 | 9.1985 | 93.2735 | 39.5681 | 38.6852 | | tnt_s_patch16_224 | 128 | 1.7998 | 10.4001 | nan | nan | 39.5495 | 37.8547 | | gluon_inception_v3 | 128 | 1.6598 | 8.5757 | 13.1416 | 186.0002 | 37.4524 | 34.6042 | | adv_inception_v3 | 128 | 1.6627 | 8.1869 | 13.2846 | 187.5769 | 36.9057 | 34.9023 | | inception_v3 | 128 | 1.7051 | 8.291 | 13.3299 | 184.6571 | 36.7813 | 34.8003 | | ghostnet_100 | 128 | 3.1209 | 9.6357 | 14.3276 | 194.4838 | 36.6066 | 34.8267 | | mixnet_l | 128 | 5.4612 | 12.381 | 26.419 | nan | 36.1489 | 33.3651 | | tf_mixnet_l | 128 | 5.8901 | 13.07 | 27.8249 | nan | 36.0253 | 34.202 | | dla102 | 128 | 1.8722 | 9.2524 | 16.0648 | 245.1995 | 35.223 | 33.3105 | | swsl_resnext101_32x16d | 32 | 1.8787 | 8.9689 | 14.956 | nan | 34.0879 | 32.1512 | | gmixer_24_224 | 128 | 1.4747 | 8.4481 | 13.6065 | 190.179 | 33.988 | 32.8599 | | dm_nfnet_f0 | 128 | 2.2978 | 7.4435 | 10.7888 | 166.5911 | 31.8371 | 31.3254 | | res2next50 | 128 | 1.6758 | 8.2905 | 12.9868 | 202.1018 | 31.7942 | 30.0252 | | convit_base | 64 | 1.2469 | 5.938 | 9.6119 | 144.9398 | 29.8019 | 28.8219 | | rexnet_100 | 128 | 1.8978 | 7.2984 | 16.8321 | nan | 29.6159 | 28.5178 | | tinynet_a | 128 | 2.1239 | 7.9298 | 20.2163 | 192.4172 | 29.1419 | 27.7794 | | tf_efficientnet_b0 | 128 | 1.8968 | 6.816 | 16.5283 | 179.4793 | 26.9186 | 23.9689 | | cspdarknet53 | 64 | 2.3118 | 7.3551 | 19.4569 | 150.6017 | 26.0721 | 24.8043 | | mixer_b16_224 | 128 | 1.016 | 3.8245 | 6.0242 | 85.7782 | 25.2516 | 23.2979 | | fbnetc_100 | 128 | 2.11 | 6.7918 | 17.171 | 134.511 | 24.7658 | 23.533 | | visformer_small | 128 | 0.9742 | 4.0534 | 6.3743 | nan | 24.6006 | 22.593 | | deit_base_distilled_patch16_224 | 64 | 1.1422 | 4.8951 | 7.3798 | 82.6982 | 24.2731 | 23.9654 | | spnasnet_100 | 128 | 2.0621 | 6.6075 | 16.6671 | 133.7901 | 24.218 | 22.5765 | | convmixer_768_32 | 32 | 1.2253 | 6.2529 | 9.8642 | nan | 24.088 | 23.0292 | | resmlp_12_224 | 128 | 0.7777 | 3.0816 | 4.8951 | 51.3346 | 23.4097 | 23.9559 | | vit_base_patch16_224 | 64 | 1.0502 | 4.5915 | 7.251 | 90.6647 | 23.171 | 22.4461 | | nfnet_l0 | 128 | 1.8696 | 7.2681 | 10.8133 | 147.7768 | 23.0655 | 22.0264 | | mobilenetv3_large_100 | 128 | 1.8361 | 5.6512 | 13.2961 | 143.4016 | 22.6009 | 21.3248 | | beit_base_patch16_224 | 64 | 1.3508 | 5.7657 | nan | nan | 22.1452 | 20.9086 | | pit_b_224 | 64 | 1.0393 | 5.2047 | 8.7011 | 115.7929 | 21.3138 | 20.6412 | | mobilenetv2_100 | 128 | 1.7597 | 5.5474 | 13.2139 | 118.8492 | 20.8844 | 20.4956 | | regnety_002 | 128 | 1.6529 | 5.5566 | 13.4781 | 114.7787 | 20.2066 | 19.0844 | | mnasnet_100 | 128 | 1.7623 | 5.3051 | 13.2946 | 107.9969 | 20.1568 | 19.0956 | | repvgg_a2 | 128 | 2.0184 | 6.14 | 15.786 | 194.4974 | 20.056 | 18.7391 | | gernet_l | 128 | 2.0144 | 6.1042 | 15.1918 | 112.7905 | 19.9967 | 18.8629 | | selecsls42b | 128 | 0.8598 | 3.7625 | 6.1409 | 89.4945 | 17.986 | 16.8717 | | lcnet_050 | 128 | 1.0622 | 3.3422 | 7.5048 | 82.8372 | 14.701 | 14.0009 | | ese_vovnet19b_dw | 128 | 1.0545 | 3.1117 | 6.7053 | 66.845 | 13.9907 | 13.2202 | | eca_halonext26ts | 128 | 1.4918 | 4.9843 | 10.6228 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9923 | 0.9245 | 0.3005 | 0.5803 | 1.3099 | 1.3731 | | gmlp_s16_224 | 128 | 0.9938 | 0.9497 | 0.3532 | 1.3134 | 1.2842 | 1.2998 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1834 | 1.3102 | | pnasnet5large | 16 | 1.057 | 0.9912 | 0.3632 | nan | 1.1603 | 1.2928 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.4758 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.998 | 0.9431 | 0.3412 | nan | 1.1021 | 1.1166 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.9694 | 0.8983 | 0.3557 | 0.4816 | 1.0335 | 1.1297 | | nfnet_l0 | 128 | 0.9885 | 0.8173 | 0.2681 | 0.3766 | 1.0331 | 1.1819 | | beit_base_patch16_224 | 64 | 0.9953 | 0.9328 | nan | nan | 1.0004 | 1.0448 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.228 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.315 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | twins_pcpvt_base | 64 | 0.9945 | 0.9233 | 0.3401 | nan | 0.9743 | 1.0803 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9621 | 1.0523 | | dla102 | 128 | 0.9694 | 0.9121 | 0.3363 | 0.9313 | 0.9556 | 1.0313 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0708 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9366 | 1.0876 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9932 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9999 | 0.9142 | nan | nan | 0.9291 | 0.979 | | convnext_base | 64 | 1.0029 | 0.926 | 0.3509 | nan | 0.9246 | 0.9863 | | ese_vovnet19b_dw | 128 | 0.9857 | 0.8565 | 0.3272 | 0.8369 | 0.9179 | 1.0682 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.8786 | 0.3675 | nan | 0.9111 | 0.981 | | mixer_b16_224 | 128 | 0.992 | 0.9362 | 0.3444 | 1.1962 | 0.9073 | 0.9799 | | dpn107 | 32 | 0.9969 | 0.9096 | 0.353 | nan | 0.9071 | 0.9971 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | gluon_xception65 | 32 | 0.9954 | 0.8854 | 0.3348 | nan | 0.8973 | 0.9761 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.8577 | 0.8973 | 1.0246 | | gluon_inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.8577 | 0.8973 | 1.0246 | | adv_inception_v3 | 128 | 0.9823 | 0.8619 | 0.3342 | 0.8577 | 0.8973 | 1.0246 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9907 | 0.907 | 0.3231 | 0.813 | 0.8769 | 0.9736 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2717 | nan | 0.8703 | 1.0094 | | gernet_l | 128 | 0.979 | 0.8501 | 0.3443 | 0.816 | 0.8617 | 0.9854 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3242 | 0.8382 | 0.8606 | 1.0105 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7522 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.159 | 0.8032 | 1.0344 | | repvgg_a2 | 128 | 0.9768 | 0.7822 | 0.3408 | 0.679 | 0.7908 | 0.9916 | | resmlp_12_224 | 128 | 0.9827 | 0.687 | 0.2373 | 0.7255 | 0.7876 | 0.8011 | | swin_base_patch4_window7_224 | 64 | 0.9969 | 0.9208 | nan | nan | 0.7569 | 0.9259 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5493 | 0.745 | 0.8292 | | jx_nest_base | 32 | 0.9983 | 0.8928 | 0.3398 | nan | 0.6707 | 0.8617 | | eca_halonext26ts | 128 | 0.9886 | 0.7748 | 0.267 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/nA2X5XI.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/Ox4y1OS.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/Gs0phMn.png)

elvinagam commented 1 year ago

@williamwen42 does all these mean that torchdynamo (almost all of its backends) does not get you any significant speedups? From the benchmarks, it looks like even nvFuser does not really help much in terms of speedups

williamwen42 commented 1 year ago

We mainly focus on speedups from the inductor/inductor_no_cudagraphs backends. The geometric mean speedup summary tables show significant speedups for these backends.

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 95%, 53/56 | 98%, 42/43  | 98%, 60/61  |
|       aot_eager        | 91%, 51/56 | 95%, 41/43  | 97%, 59/61  |
|     aot_cudagraphs     | 79%, 44/56 | 72%, 31/43  | 46%, 28/61  |
|    nvprims_nvfuser     | 80%, 45/56 | 60%, 26/43  | 67%, 41/61  |
|        inductor        | 86%, 48/56 | 77%, 33/43  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 51/56 | 91%, 39/43  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.05x    |    1.03x    |    1.14x    |
|        inductor        |   1.49x    |    1.29x    |    1.23x    |
| inductor_no_cudagraphs |   1.22x    |    1.20x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.85    |    2.33     |    2.06     |
|       aot_eager        |    5.34    |    7.45     |    7.02     |
|     aot_cudagraphs     |    7.29    |    14.25    |    13.27    |
|    nvprims_nvfuser     |   65.48    |   106.37    |   149.42    |
|        inductor        |   30.42    |    34.17    |    37.33    |
| inductor_no_cudagraphs |   30.20    |    27.71    |    35.70    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.90x    |    1.00x    |    0.95x    |
|        inductor        |   0.81x    |    0.66x    |    0.97x    |
| inductor_no_cudagraphs |   0.96x    |    0.88x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7937 | 0.9464 | | torchbench | soft_actor_critic | 1.4417 | 0.933 | | torchbench | dlrm | 0.9112 | 1.1254 | | torchbench | nvidia_deeprecommender | 0.9034 | 0.9636 | | torchbench | hf_GPT2_large | 0.0 | 1.4724 | | torchbench | hf_T5 | 0.0 | 1.5457 | | torchbench | tacotron2 | 0.0 | 0.9282 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | PegasusForCausalLM | 0.9478 | 0.9574 | | huggingface | BlenderbotSmallForCausalLM | 0.932 | 0.9614 | | huggingface | ElectraForCausalLM | 0.0 | 1.3826 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AlbertForMaskedLM | 0.0 | 1.24 | | huggingface | AlbertForQuestionAnswering | 0.0 | 1.2429 | | huggingface | BertForQuestionAnswering | 0.0 | 1.0605 | | huggingface | RobertaForQuestionAnswering | 0.0 | 1.0682 | | huggingface | LayoutLMForMaskedLM | 0.0 | 1.1597 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9947 | 0.9434 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.4899 | +-------------+-----------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 364.3458 | 362.9467 | | torchbench | timm_efficientdet | 127.9162 | 127.4528 | | torchbench | hf_T5_large | 121.362 | 117.5599 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0022 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | speech_transformer | 0.8747 | 0.8779 | | torchbench | hf_T5_large | 0.8736 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8603 | 1.031 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | hf_Albert | 0.7812 | 1.2214 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | hf_Bart | 0.7549 | 1.0072 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | fastNLP_Bert | 0.7406 | 1.1229 | | torchbench | BERT_pytorch | 0.7067 | 0.9033 | | torchbench | dlrm | 0.7035 | 0.7307 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6899 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_Bert | 0.6432 | 0.8995 | | torchbench | hf_DistilBert | 0.613 | 0.8537 | | torchbench | hf_Reformer | 0.577 | 1.0027 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4429 | 0.5961 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4061 | 0.4214 | | torchbench | dcgan | 0.2564 | 0.2576 | | huggingface | T5Small | 0.8564 | 1.042 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.0502 | | huggingface | DistillGPT2 | 0.8171 | 0.9378 | | huggingface | XGLMForCausalLM | 0.8157 | 0.8962 | | huggingface | YituTechConvBert | 0.7972 | 0.8792 | | huggingface | PegasusForConditionalGeneration | 0.7893 | 0.9466 | | huggingface | M2M100ForConditionalGeneration | 0.7658 | 0.9595 | | huggingface | MT5ForConditionalGeneration | 0.7622 | 0.8488 | | huggingface | GoogleFnet | 0.7568 | 0.9682 | | huggingface | CamemBert | 0.7149 | 0.8698 | | huggingface | BartForConditionalGeneration | 0.6979 | 0.8969 | | huggingface | PLBartForCausalLM | 0.6852 | 0.806 | | huggingface | PegasusForCausalLM | 0.6791 | 0.8947 | | huggingface | BlenderbotSmallForCausalLM | 0.6618 | 0.7576 | | huggingface | PLBartForConditionalGeneration | 0.6555 | 0.8258 | | huggingface | MegatronBertForQuestionAnswering | 0.6467 | 0.797 | | huggingface | OPTForCausalLM | 0.6404 | 0.8245 | | huggingface | BartForCausalLM | 0.6359 | 0.8919 | | huggingface | MBartForConditionalGeneration | 0.63 | 0.7668 | | huggingface | MegatronBertForCausalLM | 0.6276 | 0.7821 | | huggingface | LayoutLMForSequenceClassification | 0.6247 | 0.9889 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6148 | 0.8546 | | huggingface | TrOCRForCausalLM | 0.6094 | 0.7677 | | huggingface | MBartForCausalLM | 0.6078 | 0.7715 | | huggingface | ElectraForQuestionAnswering | 0.6054 | 0.9848 | | huggingface | DistilBertForMaskedLM | 0.6017 | 0.8152 | | huggingface | DistilBertForQuestionAnswering | 0.595 | 0.7558 | | huggingface | Speech2Text2ForCausalLM | 0.5787 | 0.8128 | | huggingface | BertForMaskedLM | 0.5613 | 0.7534 | | huggingface | RobertaForCausalLM | 0.5604 | 0.7519 | | huggingface | MobileBertForMaskedLM | 0.4624 | 0.6 | | huggingface | DebertaForMaskedLM | 0.3862 | 0.9713 | | huggingface | MobileBertForQuestionAnswering | 0.3725 | 0.4638 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | ElectraForCausalLM | nan | 0.8074 | | huggingface | BertForQuestionAnswering | nan | 0.6814 | | huggingface | RobertaForQuestionAnswering | nan | 0.6814 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9609 | | timm_models | swin_base_patch4_window7_224 | 0.8514 | 1.0359 | | timm_models | sebotnet33ts_256 | 0.8365 | 0.965 | | timm_models | pit_b_224 | 0.8169 | 1.0652 | | timm_models | resmlp_12_224 | 0.8029 | 0.811 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | coat_lite_mini | 0.737 | 1.0402 | | timm_models | convit_base | 0.6848 | 0.8081 | | timm_models | crossvit_9_240 | 0.5717 | 0.7352 | | timm_models | repvgg_a2 | 0.5319 | 0.8171 | | timm_models | tnt_s_patch16_224 | nan | 0.7096 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/ksYrDOK.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/UxldbZ8.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0013 | 1.0229 | 2.3665 | 0.7769 | 5.2929 | 1.2788 | | timm_efficientdet | 1 | 0.9852 | 0.8896 | 1.8285 | 0.7671 | 4.3854 | 1.5184 | | functorch_dp_cifar10 | 64 | 1.0049 | 1.0293 | 2.0102 | 0.0 | 3.6303 | 1.2085 | | timm_vision_transformer | 8 | 1.005 | 0.9485 | 1.5296 | 0.6897 | 2.4966 | 1.4119 | | drq | 1 | 1.0176 | 0.8783 | 1.6845 | 0.7569 | 2.4869 | 1.0684 | | BERT_pytorch | 16 | 1.0084 | 0.8972 | 1.0972 | 0.9861 | 2.0181 | 1.9841 | | resnext50_32x4d | 8 | 1.0019 | 1.116 | 1.2105 | 0.8001 | 2.0025 | 1.2014 | | mobilenet_v3_large | 32 | 1.0056 | 1.1124 | 1.0424 | 0.8675 | 1.9771 | 1.3395 | | dcgan | 32 | 0.988 | 1.0341 | 1.2635 | 0.7967 | 1.8967 | 1.0245 | | resnet18 | 16 | 1.0046 | 1.1171 | 1.1809 | 0.8901 | 1.8432 | 1.1878 | | pytorch_struct | 200 | 0.9942 | 0.7533 | 0.8829 | 0.8138 | 1.8121 | 1.1594 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9975 | 1.0235 | 1.3148 | 0.8614 | 1.8024 | 1.3505 | | lennard_jones | 1000 | 0.9686 | 0.8464 | 1.0504 | 0.7018 | 1.7937 | 0.9464 | | squeezenet1_1 | 32 | 0.9967 | 1.0118 | 1.0354 | 0.8694 | 1.7038 | 1.2515 | | hf_T5_large | 2 | 1.0265 | 0.907 | 0.0 | 0.0 | 1.643 | 1.6124 | | hf_Albert | 8 | 1.0007 | 0.9978 | 0.7535 | 1.562 | 1.6055 | 1.6009 | | shufflenet_v2_x1_0 | 128 | 1.0023 | 1.0611 | 0.8096 | 0.9061 | 1.5271 | 1.3738 | | timm_resnest | 32 | 0.9993 | 1.0016 | 0.8064 | 1.1665 | 1.5211 | 1.4516 | | hf_GPT2 | 4 | 1.0108 | 0.9819 | 0.741 | 0.4043 | 1.4935 | 1.5066 | | timm_nfnet | 128 | 0.9996 | 0.9997 | 0.0 | 1.135 | 1.4748 | 1.421 | | mnasnet1_0 | 32 | 0.9998 | 1.0947 | 0.8538 | 0.9189 | 1.4586 | 1.2742 | | soft_actor_critic | 256 | 0.9969 | 0.7979 | 1.0741 | 0.6969 | 1.4417 | 0.933 | | mobilenet_v2 | 96 | 0.9998 | 0.9994 | 0.73 | 1.3372 | 1.427 | 1.4073 | | speech_transformer | 32 | 1.0027 | 0.89 | 1.4164 | 0.774 | 1.4138 | 1.4075 | | mobilenet_v2_quantized_qat | 96 | 1.001 | 0.9798 | 0.0 | 1.4555 | 1.4064 | 1.4115 | | fastNLP_Bert | 6 | 0.9994 | 0.978 | 0.7524 | 1.1969 | 1.3862 | 1.3625 | | resnet50_quantized_qat | 32 | 1.0013 | 0.9715 | 0.0 | 1.1655 | 1.3608 | 1.3644 | | timm_efficientnet | 32 | 0.9533 | 0.8101 | 0.7 | 0.8248 | 1.344 | 1.1845 | | LearningToPaint | 96 | 1.0024 | 1.062 | 0.8941 | 0.968 | 1.2822 | 1.2085 | | pytorch_stargan | 16 | 0.9995 | 1.0768 | 0.9338 | 0.0 | 1.2669 | 1.2297 | | resnet152 | 32 | 1.0009 | 1.0511 | 0.8038 | 0.908 | 1.2482 | 1.2037 | | pytorch_unet | 1 | 0.9998 | 0.9979 | 0.8455 | 1.0899 | 1.2043 | 1.1903 | | resnet50 | 32 | 0.9991 | 0.9926 | 0.761 | 1.0044 | 1.2038 | 1.1689 | | vgg16 | 64 | 0.9999 | 0.999 | 0.8588 | 0.9974 | 1.1709 | 1.1677 | | hf_Bert | 4 | 1.0294 | 1.0023 | 0.7389 | 0.9074 | 1.1626 | 1.1496 | | hf_Bart | 4 | 1.0135 | 0.9783 | 0.7387 | 0.9043 | 1.1623 | 1.1437 | | alexnet | 128 | 0.999 | 0.9984 | 0.8026 | 1.0053 | 1.1612 | 1.1626 | | hf_DistilBert | 8 | 1.0007 | 0.9574 | 0.6886 | 0.5305 | 1.1539 | 1.1597 | | Super_SloMo | 6 | 0.9999 | 0.9979 | 0.8684 | 0.9924 | 1.1363 | 1.1225 | | timm_regnet | 32 | 0.964 | 0.9634 | 0.7799 | 1.0937 | 1.1314 | 1.093 | | hf_Reformer | 4 | 0.9986 | 1.0023 | 0.9891 | 0.7348 | 1.1284 | 1.1377 | | Background_Matting | 4 | 1.0002 | 1.0211 | 0.8662 | 1.0777 | 1.1167 | 1.109 | | yolov3 | 16 | 1.0 | 0.9947 | 0.7901 | 1.1516 | 1.0911 | 1.0778 | | timm_vision_transformer_large | 8 | 0.9998 | 0.9929 | 0.0 | 0.0 | 1.0273 | 1.0142 | | attention_is_all_you_need_pytorch | 256 | 1.0001 | 0.971 | 0.7579 | 0.9563 | 1.0271 | 1.0135 | | timm_vovnet | 32 | 0.9097 | 0.9022 | 0.7141 | 0.9039 | 1.0082 | 1.0183 | | tts_angular | 64 | 0.9901 | 0.9661 | 0.9947 | 0.9794 | 1.0022 | 1.0151 | | demucs | 4 | 1.0 | 1.0001 | 1.0005 | 0.9997 | 0.9996 | 0.9995 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 1.0879 | 0.9112 | 1.1254 | | nvidia_deeprecommender | 256 | 0.999 | 0.9632 | 0.5849 | 0.9767 | 0.9034 | 0.9636 | | hf_GPT2_large | 4 | 0.9999 | 0.9807 | 0.0 | 0.0 | 0.0 | 1.4724 | | hf_T5 | 8 | 0.9997 | 0.9522 | 0.0 | 1.2246 | 0.0 | 1.5457 | | tacotron2 | 64 | 0.9808 | 0.8694 | 0.0 | 0.7616 | 0.0 | 0.9282 | | hf_BigBird | 2 | 0.977 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9107 | 7.0982 | 10.0608 | 112.9253 | 364.3458 | 362.9467 | | timm_efficientdet | 1 | 19.6621 | 33.6804 | 65.9623 | 488.9412 | 127.9162 | 127.4528 | | hf_T5_large | 2 | 14.297 | 33.9291 | nan | nan | 121.362 | 117.5599 | | timm_vision_transformer_large | 8 | 2.4858 | 11.1322 | nan | nan | 55.0654 | 52.1236 | | attention_is_all_you_need_pytorch | 256 | 1.1807 | 5.6949 | 8.9278 | 132.6141 | 49.2562 | 48.2599 | | resnet152 | 32 | 2.4246 | 10.6802 | 17.7724 | 194.7625 | 44.234 | 41.911 | | densenet121 | 4 | 2.2018 | 9.8707 | 15.7462 | 169.8299 | 43.6037 | 42.7673 | | timm_resnest | 32 | 0.5686 | 2.031 | 3.058 | 60.0291 | 39.8309 | 39.8228 | | timm_vision_transformer | 8 | 0.83 | 3.3531 | 4.8586 | 76.4548 | 32.2226 | 31.7842 | | speech_transformer | 32 | 1.5975 | 6.4661 | 29.0835 | 148.9667 | 32.0724 | 30.8203 | | hf_Bart | 4 | 1.7029 | 6.5345 | 10.2751 | 146.3309 | 31.3254 | 29.7679 | | BERT_pytorch | 16 | 1.5252 | 5.8675 | 8.847 | 98.9777 | 30.0625 | 29.7369 | | fastNLP_Bert | 6 | 1.5563 | 5.3224 | 8.6086 | 93.5337 | 29.4406 | 27.6988 | | timm_nfnet | 128 | 2.0516 | 6.225 | nan | 153.4985 | 29.0839 | 28.2777 | | mobilenet_v2_quantized_qat | 96 | 1.345 | 7.3008 | nan | 186.062 | 28.227 | 28.2695 | | resnet50_quantized_qat | 32 | 1.2022 | 7.0314 | nan | 169.4407 | 27.9002 | 27.7713 | | pytorch_stargan | 16 | 0.4552 | 1.7186 | 2.4703 | nan | 27.3035 | 26.8587 | | timm_regnet | 32 | 2.3365 | 6.5912 | 17.2226 | 113.2603 | 24.164 | 23.0101 | | pytorch_struct | 200 | 0.2488 | 0.631 | 1.1589 | 4.818 | 23.8439 | 23.0116 | | timm_efficientnet | 32 | 1.781 | 5.579 | 13.8295 | 110.1757 | 23.7159 | 22.9086 | | mobilenet_v3_large | 32 | 0.8962 | 3.978 | 5.8706 | 100.7809 | 23.1453 | 22.5435 | | hf_Bert | 4 | 1.5723 | 5.2252 | 7.9668 | 95.9169 | 21.1819 | 20.4505 | | hf_Albert | 8 | 1.2706 | 4.5813 | 7.3559 | 111.4669 | 19.3276 | 18.5039 | | mnasnet1_0 | 32 | 0.8401 | 3.4697 | 5.1942 | 72.7411 | 18.7208 | 18.0467 | | timm_vovnet | 32 | 1.5091 | 3.7934 | 8.9323 | 56.707 | 18.0628 | 17.431 | | hf_Reformer | 4 | 1.5712 | 2.6061 | 4.7763 | 15.8125 | 18.0557 | 15.7056 | | resnet50 | 32 | 0.8783 | 3.7428 | 5.5954 | 74.8353 | 17.9479 | 17.4912 | | shufflenet_v2_x1_0 | 128 | 0.9526 | 4.1408 | 6.378 | 90.4917 | 17.708 | 17.7362 | | resnext50_32x4d | 8 | 0.879 | 3.722 | 5.5656 | 65.7325 | 17.4539 | 16.7989 | | hf_GPT2 | 4 | 1.5243 | 4.906 | 7.634 | 83.5822 | 17.1253 | 16.4462 | | mobilenet_v2 | 96 | 0.8261 | 3.6839 | 5.9281 | 98.9977 | 16.925 | 16.2656 | | Background_Matting | 4 | 0.7007 | 3.3434 | 5.3298 | 69.4498 | 15.8155 | 14.7694 | | functorch_dp_cifar10 | 64 | 0.3041 | 1.1895 | 1.6906 | nan | 15.3347 | 14.7258 | | Super_SloMo | 6 | 0.8856 | 3.3405 | 4.9467 | 27.6731 | 14.3634 | 13.7667 | | hf_DistilBert | 8 | 0.6724 | 2.6137 | 4.6578 | 45.749 | 13.1865 | 12.9086 | | resnet18 | 16 | 0.4168 | 1.4561 | 2.1759 | 29.2935 | 10.4038 | 10.4693 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4015 | 1.5559 | 2.3456 | 32.3495 | 8.035 | 7.8126 | | pytorch_unet | 1 | 0.3955 | 1.3947 | 2.2157 | 30.9367 | 7.2895 | 7.0841 | | LearningToPaint | 96 | 0.4359 | 1.5517 | 2.5022 | 39.6257 | 7.0144 | 6.647 | | dcgan | 32 | 0.1712 | 0.3604 | 0.572 | 4.4684 | 6.0439 | 5.6897 | | drq | 1 | 0.2963 | 0.4868 | 0.83 | 4.5176 | 3.9997 | 3.2826 | | squeezenet1_1 | 32 | 0.2238 | 0.6457 | 1.0131 | 4.6541 | 3.8587 | 3.6618 | | vgg16 | 64 | 0.1918 | 0.4683 | 0.8375 | 3.1801 | 3.5014 | 3.2011 | | dlrm | 2048 | nan | nan | nan | 2.897 | 3.2927 | 2.8449 | | nvidia_deeprecommender | 256 | 0.2002 | 0.3864 | 0.668 | 4.935 | 3.256 | 3.0751 | | soft_actor_critic | 256 | 0.1966 | 0.293 | 0.4837 | 1.6081 | 3.2468 | 2.7704 | | alexnet | 128 | 0.1472 | 0.3178 | 0.5655 | 3.1009 | 2.94 | 2.6242 | | lennard_jones | 1000 | 0.1394 | 0.2454 | 0.3905 | 1.4321 | 1.969 | 1.8 | | tts_angular | 64 | 0.1743 | 0.2164 | 0.3434 | 1.0886 | 1.883 | 1.6765 | | demucs | 4 | 0.3007 | 0.2986 | 0.3005 | 0.3014 | 0.2076 | 0.2093 | | hf_GPT2_large | 4 | 5.2534 | 15.8371 | nan | nan | nan | 44.8464 | | tacotron2 | 64 | 5.7153 | 15.3677 | nan | 45.9436 | nan | 43.7474 | | hf_T5 | 8 | 2.4786 | 7.8242 | nan | 89.6134 | nan | 28.3921 | | hf_BigBird | 2 | 3.268 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.582 | 1.582 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.2258 | 1.4877 | 1.4877 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | 0.988 | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0023 | 0.9527 | 0.363 | 0.9891 | 1.2025 | 1.4001 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2892 | 1.1374 | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 1.0219 | 1.0958 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9682 | 0.9833 | 1.0395 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 0.9821 | 1.0223 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1243 | 0.9703 | 1.1698 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9406 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8549 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9162 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8497 | 0.9118 | 1.105 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9068 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3235 | 0.7926 | 0.8982 | 1.0022 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8829 | 0.896 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2737 | 1.1201 | 0.8747 | 0.8779 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8736 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8603 | 1.031 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8564 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.7812 | 1.2214 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | hf_Bart | 4 | 1.0 | 0.8777 | 0.3386 | 1.0858 | 0.7549 | 1.0072 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2131 | 0.7406 | 1.1229 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3504 | 1.1272 | 0.7067 | 0.9033 | | dlrm | 2048 | nan | nan | nan | 0.7306 | 0.7035 | 0.7307 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9453 | 0.6896 | 0.3385 | 0.6507 | 0.6899 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0014 | 0.6432 | 0.8995 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0239 | 0.613 | 0.8537 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.5934 | 0.9999 | 0.577 | 1.0027 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9678 | 0.4429 | 0.5961 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4061 | 0.4214 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.0841 | nan | 1.1613 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9326 | nan | 1.1462 | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 2048 | nan | nan | nan | 486.4839 | 551.0126 | 503.0938 | | timm_vision_transformer_large | 8 | 196.2774 | 199.3173 | nan | nan | 192.1 | 194.3049 | | Background_Matting | 4 | 186.4067 | 182.6159 | 215.2651 | 172.9209 | 166.9128 | 168.1331 | | timm_nfnet | 128 | 206.066 | 205.6844 | nan | 181.3978 | 139.6883 | 144.9258 | | hf_T5_large | 2 | 190.7897 | 213.0947 | nan | nan | 120.75 | 123.1824 | | mobilenet_v2_quantized_qat | 96 | 146.9127 | 150.1894 | nan | 101.32 | 104.8786 | 104.8604 | | Super_SloMo | 6 | 117.5904 | 117.7192 | 135.5528 | 118.3334 | 103.4078 | 104.7123 | | yolov3 | 16 | 102.5481 | 102.7462 | 129.3718 | 88.9994 | 93.7598 | 94.7122 | | vgg16 | 64 | 106.1617 | 106.4114 | 123.8387 | 106.6202 | 90.6781 | 91.0754 | | timm_regnet | 32 | 101.2241 | 101.156 | 124.9717 | 89.4718 | 86.7435 | 89.229 | | demucs | 4 | 77.0879 | 77.863 | 77.689 | 77.6528 | 77.5546 | 77.744 | | hf_Reformer | 4 | 83.3121 | 82.9382 | 83.9987 | 113.0771 | 73.6591 | 73.1421 | | resnet152 | 32 | 89.5283 | 85.4059 | 113.0749 | 100.517 | 73.5558 | 76.2466 | | attention_is_all_you_need_pytorch | 256 | 72.0218 | 74.0025 | 95.382 | 75.1056 | 70.3952 | 71.1974 | | resnet50_quantized_qat | 32 | 92.3715 | 95.9321 | nan | 79.9167 | 68.3471 | 68.897 | | mobilenet_v2 | 96 | 71.2605 | 71.384 | 97.6972 | 53.3249 | 49.9591 | 50.6691 | | pytorch_unet | 1 | 58.4382 | 58.6043 | 69.1647 | 53.6624 | 48.6381 | 49.1705 | | hf_Bart | 4 | 54.6438 | 56.0997 | 74.6911 | 60.3237 | 47.7825 | 47.8275 | | hf_Albert | 8 | 75.0631 | 75.225 | 99.9165 | 48.0695 | 46.8226 | 46.8022 | | fastNLP_Bert | 6 | 59.6468 | 60.7416 | 79.3066 | 49.6037 | 43.0752 | 43.7109 | | timm_vovnet | 32 | 42.335 | 42.702 | 53.9808 | 42.5729 | 38.1823 | 37.7631 | | speech_transformer | 32 | 50.4164 | 54.1633 | 37.9344 | 62.0 | 35.6432 | 35.763 | | timm_efficientdet | 1 | 138.4806 | 164.288 | 75.7453 | 176.9796 | 34.2793 | 92.1136 | | hf_DistilBert | 8 | 38.8879 | 40.6227 | 56.5827 | 73.1965 | 33.8148 | 33.7716 | | hf_GPT2 | 4 | 49.8131 | 51.0628 | 68.1047 | 124.4498 | 33.696 | 33.4731 | | hf_Bert | 4 | 37.9753 | 39.0395 | 53.3307 | 42.6042 | 33.5386 | 34.0222 | | resnet50 | 32 | 38.6505 | 38.9276 | 50.803 | 38.523 | 32.1214 | 33.0118 | | timm_efficientnet | 32 | 43.9782 | 51.7429 | 60.7669 | 51.6423 | 32.0986 | 36.6316 | | shufflenet_v2_x1_0 | 128 | 36.7626 | 35.1158 | 45.8507 | 40.8413 | 24.428 | 28.5348 | | BERT_pytorch | 16 | 45.0172 | 51.4908 | 42.0631 | 46.2652 | 23.984 | 24.0239 | | timm_resnest | 32 | 31.5654 | 31.4648 | 39.3393 | 27.1078 | 20.8108 | 21.7128 | | mnasnet1_0 | 32 | 28.4737 | 25.7451 | 33.1146 | 31.1779 | 19.4488 | 22.5438 | | pytorch_stargan | 16 | 24.2382 | 22.4637 | 25.9314 | nan | 19.0792 | 19.6769 | | mobilenet_v3_large | 32 | 30.7183 | 28.3707 | 30.4859 | 36.3985 | 16.1187 | 23.82 | | resnext50_32x4d | 8 | 26.4681 | 23.9435 | 22.1402 | 33.1498 | 13.4098 | 22.8509 | | densenet121 | 4 | 63.6023 | 63.5041 | 27.7115 | 83.8714 | 12.7718 | 53.2348 | | LearningToPaint | 96 | 15.5896 | 14.6791 | 18.4243 | 17.1167 | 12.2364 | 12.9973 | | alexnet | 128 | 12.413 | 12.4129 | 15.474 | 12.3492 | 10.6724 | 10.6672 | | timm_vision_transformer | 8 | 23.5317 | 25.0985 | 15.6413 | 38.0778 | 9.7784 | 17.4842 | | nvidia_deeprecommender | 256 | 8.5338 | 8.846 | 14.5675 | 8.7335 | 9.4152 | 8.8409 | | tts_angular | 64 | 9.2489 | 9.6623 | 9.522 | 9.7759 | 9.2478 | 9.11 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.1503 | 16.2215 | 12.6314 | 19.2179 | 9.2082 | 12.1592 | | squeezenet1_1 | 32 | 12.4125 | 12.51 | 12.0563 | 14.821 | 7.3277 | 10.3422 | | resnet18 | 16 | 11.7399 | 10.71 | 10.2757 | 13.6939 | 6.5972 | 10.9408 | | functorch_dp_cifar10 | 64 | 11.2591 | 11.8989 | 5.6965 | nan | 3.2191 | 9.4904 | | pytorch_struct | 200 | 3.8939 | 4.9704 | 4.2492 | 5.5757 | 2.0903 | 3.3302 | | dcgan | 32 | 2.6428 | 2.5698 | 2.0953 | 3.6495 | 1.3576 | 2.5433 | | drq | 1 | 2.8392 | 3.3885 | 1.7247 | 5.0033 | 1.1989 | 2.8944 | | soft_actor_critic | 256 | 1.003 | 1.2688 | 0.9462 | 1.4955 | 0.7325 | 1.131 | | lennard_jones | 1000 | 1.0644 | 1.2146 | 1.0495 | 1.6231 | 0.6087 | 1.226 | | tacotron2 | 64 | 2710.6353 | 3098.9023 | nan | 3555.5295 | nan | 3015.1642 | | hf_GPT2_large | 4 | 240.6295 | 245.4091 | nan | nan | nan | 163.4475 | | hf_T5 | 8 | 183.1172 | 191.9905 | nan | 149.1899 | nan | 118.2933 | | hf_BigBird | 2 | 185.7023 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0276 | 0.9587 | 1.7767 | 0.7725 | 3.2938 | 1.4518 | | CamemBert | 1 | 1.0511 | 0.9681 | 1.339 | 0.7631 | 2.4047 | 1.5211 | | DistillGPT2 | 1 | 1.0259 | 0.96 | 1.0666 | 0.0 | 2.1658 | 1.8109 | | MT5ForConditionalGeneration | 8 | 1.0239 | 0.9368 | 1.2197 | 1.0061 | 2.1609 | 2.2343 | | MobileBertForMaskedLM | 32 | 1.0253 | 0.9501 | 1.1343 | 0.0 | 2.0247 | 1.5586 | | GoogleFnet | 1 | 0.9774 | 0.808 | 0.9879 | 0.0 | 1.877 | 1.1454 | | GPT2ForSequenceClassification | 4 | 1.0004 | 0.978 | 0.0 | 0.7154 | 1.8027 | 1.7798 | | T5ForConditionalGeneration | 4 | 1.0007 | 0.937 | 0.7267 | 1.1498 | 1.4134 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.0249 | 0.9674 | 0.8769 | 0.0 | 1.3976 | 1.4481 | | M2M100ForConditionalGeneration | 8 | 1.0079 | 0.952 | 0.8322 | 0.7854 | 1.3725 | 1.271 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.986 | 0.0 | 1.2445 | 1.3654 | 1.3423 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9907 | 0.738 | 1.1488 | 1.2639 | 1.2448 | | T5Small | 1 | 1.0212 | 0.9251 | 1.0353 | 1.0124 | 1.2525 | 1.1614 | | PLBartForConditionalGeneration | 16 | 1.0152 | 0.9644 | 0.8151 | 0.825 | 1.2494 | 1.1428 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0211 | 0.7651 | 0.8774 | 1.1612 | 1.085 | | OPTForCausalLM | 32 | 1.0016 | 0.933 | 0.7163 | 0.4585 | 1.1485 | 1.1812 | | XGLMForCausalLM | 8 | 1.0126 | 0.9457 | 0.739 | 0.3226 | 1.1372 | 1.2174 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.9859 | 0.7131 | 0.5098 | 1.1269 | 1.105 | | RobertaForCausalLM | 64 | 1.0006 | 0.9542 | 0.7461 | 0.9759 | 1.1026 | 1.1075 | | MegatronBertForCausalLM | 16 | 1.0345 | 1.0075 | 0.7449 | 0.9615 | 1.1022 | 1.1695 | | DebertaForMaskedLM | 4 | 0.9131 | 0.8101 | 0.7294 | 0.6364 | 1.0618 | 1.0299 | | MBartForConditionalGeneration | 16 | 1.0108 | 0.9849 | 0.7595 | 0.0 | 1.0558 | 1.0533 | | BartForConditionalGeneration | 2 | 1.0007 | 0.9883 | 0.0 | 0.449 | 1.0528 | 1.0451 | | PegasusForConditionalGeneration | 16 | 1.011 | 0.9858 | 0.8113 | 0.8938 | 1.0493 | 1.0382 | | DebertaForQuestionAnswering | 8 | 0.9969 | 0.9775 | 0.6835 | 0.8673 | 1.0417 | 1.2069 | | Speech2Text2ForCausalLM | 128 | 0.9986 | 0.9281 | 0.6619 | 0.9346 | 1.0372 | 1.0618 | | BartForCausalLM | 4 | 1.0002 | 0.9674 | 0.7552 | 0.9953 | 1.0214 | 1.0304 | | BertForMaskedLM | 64 | 1.0002 | 0.9632 | 0.7307 | 0.972 | 1.0109 | 1.0108 | | DistilBertForMaskedLM | 64 | 1.0003 | 0.9525 | 0.7076 | 0.6318 | 1.0009 | 1.0141 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0009 | 0.9233 | 0.0 | 0.9488 | 0.9982 | 1.0021 | | PLBartForCausalLM | 32 | 1.0059 | 0.9351 | 0.7155 | 0.9148 | 0.963 | 1.0239 | | TrOCRForCausalLM | 32 | 1.0013 | 0.9476 | 0.7356 | 0.9473 | 0.9561 | 0.966 | | MBartForCausalLM | 32 | 1.0013 | 0.9512 | 0.7335 | 0.0 | 0.9543 | 0.9628 | | PegasusForCausalLM | 32 | 0.9997 | 0.9547 | 0.7327 | 0.9435 | 0.9478 | 0.9574 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9114 | 0.6832 | 0.9151 | 0.932 | 0.9614 | | ElectraForCausalLM | 32 | 1.0005 | 0.9335 | 0.0 | 1.0371 | 0.0 | 1.3826 | | BigBird | 1 | 0.9792 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AlbertForMaskedLM | 4 | 1.0004 | 1.0001 | 0.0 | 1.2253 | 0.0 | 1.24 | | AlbertForQuestionAnswering | 4 | 1.0 | 1.0006 | 0.0 | 1.2328 | 0.0 | 1.2429 | | BertForQuestionAnswering | 128 | 1.0004 | 0.9949 | 0.0 | 1.026 | 0.0 | 1.0605 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9923 | 0.0 | 1.027 | 0.0 | 1.0682 | | LayoutLMForMaskedLM | 16 | 1.0004 | 0.972 | 0.0 | 1.0852 | 0.0 | 1.1597 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.6919 | 9.8473 | 34.1024 | 76.9959 | 96.7185 | 36.5651 | | DebertaForMaskedLM | 4 | 4.6241 | 9.6283 | 34.1195 | 79.9897 | 96.4397 | 35.8467 | | XGLMForCausalLM | 8 | 2.5712 | 10.0732 | 21.9631 | 235.0217 | 71.405 | 68.2585 | | M2M100ForConditionalGeneration | 8 | 3.4756 | 12.1962 | 21.7824 | 356.6119 | 59.934 | 59.323 | | MobileBertForMaskedLM | 32 | 8.7264 | 23.4332 | 40.5337 | nan | 58.3241 | 56.9646 | | MobileBertForQuestionAnswering | 64 | 8.6221 | 24.173 | 40.9278 | nan | 57.2579 | 55.5437 | | BartForConditionalGeneration | 2 | 3.2328 | 12.5294 | nan | 333.4192 | 49.5492 | 48.061 | | PegasusForConditionalGeneration | 16 | 3.0325 | 12.0537 | 20.9775 | 352.5315 | 49.2027 | 44.381 | | MBartForConditionalGeneration | 16 | 3.2015 | 12.6769 | 21.2127 | nan | 48.4798 | 47.2045 | | YituTechConvBert | 1 | 2.4416 | 8.2143 | 12.5291 | 145.8498 | 42.1774 | 38.3249 | | MegatronBertForCausalLM | 16 | 3.2544 | 10.9636 | 17.0283 | 241.6357 | 37.8445 | 37.3099 | | MegatronBertForQuestionAnswering | 16 | 3.2977 | 11.0842 | 16.9136 | 236.0639 | 37.3836 | 36.4092 | | MT5ForConditionalGeneration | 8 | 3.7853 | 10.9973 | 17.8775 | 141.8349 | 37.2377 | 35.7628 | | BlenderbotSmallForConditionalGeneration | 64 | 2.0509 | 8.3814 | nan | 203.3923 | 33.1136 | 32.2474 | | T5ForConditionalGeneration | 4 | 2.4583 | 8.2286 | 11.5841 | 90.0984 | 31.897 | 30.6294 | | PLBartForConditionalGeneration | 16 | 1.6325 | 6.6014 | 9.9247 | 137.2841 | 29.3772 | 27.4926 | | LayoutLMForSequenceClassification | 16 | 1.9719 | 5.6764 | 9.2329 | 104.4775 | 29.2151 | 27.7282 | | T5Small | 1 | 2.4776 | 7.8343 | 11.3346 | 91.7968 | 29.1963 | 28.1769 | | ElectraForQuestionAnswering | 64 | 1.6288 | 5.3211 | nan | 97.9668 | 22.5828 | 21.0286 | | BertForMaskedLM | 64 | 1.5753 | 5.2502 | 8.0628 | 106.3193 | 22.3645 | 21.3926 | | GoogleFnet | 1 | 0.9597 | 2.8124 | 10.6021 | nan | 22.1689 | 16.7235 | | PegasusForCausalLM | 32 | 1.2587 | 4.8561 | 7.8155 | 96.3347 | 22.1675 | 20.73 | | MBartForCausalLM | 32 | 1.198 | 4.7525 | 7.22 | nan | 22.0675 | 21.0022 | | RobertaForCausalLM | 64 | 1.6515 | 5.5613 | 8.2212 | 103.4405 | 21.8041 | 20.7044 | | TrOCRForCausalLM | 32 | 1.1817 | 4.6279 | 7.6706 | 97.3523 | 20.8181 | 19.7006 | | BartForCausalLM | 4 | 1.2317 | 4.7478 | 7.4671 | 98.7987 | 19.9214 | 19.2078 | | OPTForCausalLM | 32 | 1.2678 | 4.776 | 8.8539 | 96.3646 | 19.8811 | 18.4488 | | CamemBert | 1 | 1.6418 | 5.3422 | 7.5725 | 106.0235 | 18.9871 | 18.5212 | | GPT2ForSequenceClassification | 4 | 1.601 | 5.0188 | nan | 82.8827 | 17.0815 | 15.8423 | | BlenderbotSmallForCausalLM | 64 | 0.832 | 3.2572 | 4.986 | 61.9287 | 14.6022 | 14.1533 | | Speech2Text2ForCausalLM | 128 | 0.7469 | 2.6703 | 4.3772 | 44.5615 | 14.2606 | 13.2097 | | PLBartForCausalLM | 32 | 0.6892 | 2.6349 | 3.8528 | 52.058 | 13.6728 | 12.9897 | | DistilBertForMaskedLM | 64 | 0.6656 | 2.6066 | 4.6481 | 48.4498 | 13.1084 | 12.4306 | | DistillGPT2 | 1 | 0.8425 | 2.778 | 3.7276 | nan | 12.9354 | 12.1011 | | DistilBertForQuestionAnswering | 64 | 0.6644 | 2.7418 | 4.5614 | 44.0749 | 12.3811 | 11.7256 | | ElectraForCausalLM | 32 | 1.6006 | 5.3387 | nan | 98.8619 | nan | 25.6533 | | LayoutLMForMaskedLM | 16 | 1.8661 | 5.8042 | nan | 102.2133 | nan | 21.4648 | | BertForQuestionAnswering | 128 | 1.615 | 5.3028 | nan | 101.4052 | nan | 20.1982 | | RobertaForQuestionAnswering | 128 | 1.61 | 5.3934 | nan | 100.7544 | nan | 19.5396 | | AlbertForMaskedLM | 4 | 1.2564 | 4.5753 | nan | 116.3919 | nan | 16.3381 | | AlbertForQuestionAnswering | 4 | 1.362 | 4.7062 | nan | 112.5085 | nan | 16.145 | | BigBird | 1 | 3.3752 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1726 | 1.0595 | 1.1588 | | T5Small | 1 | 1.0 | 0.9155 | 0.3432 | 0.9247 | 0.8564 | 1.042 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.972 | 0.8215 | 1.0502 | | DistillGPT2 | 1 | 0.9986 | 0.8218 | 0.3793 | nan | 0.8171 | 0.9378 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9891 | 0.8157 | 0.8962 | | YituTechConvBert | 1 | 0.9858 | 0.8616 | 0.3686 | 0.9032 | 0.7972 | 0.8792 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 1.0877 | 0.7893 | 0.9466 | | M2M100ForConditionalGeneration | 8 | 0.9687 | 0.9611 | 0.3674 | 0.9822 | 0.7658 | 0.9595 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | 0.415 | 0.9326 | 0.7622 | 0.8488 | | GoogleFnet | 1 | 0.9629 | 0.9629 | 0.3852 | nan | 0.7568 | 0.9682 | | CamemBert | 1 | 0.998 | 0.8248 | 0.3615 | 0.8613 | 0.7149 | 0.8698 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.6979 | 0.8969 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.9443 | 0.6852 | 0.806 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 0.9964 | 0.6791 | 0.8947 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.6618 | 0.7576 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8954 | 0.3584 | 1.0143 | 0.6555 | 0.8258 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.9908 | 0.6467 | 0.797 | | OPTForCausalLM | 32 | 0.9999 | 0.8655 | 0.3605 | 0.9158 | 0.6404 | 0.8245 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.6359 | 0.8919 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | nan | 0.63 | 0.7668 | | MegatronBertForCausalLM | 16 | 1.0 | 0.8826 | 0.352 | 0.9985 | 0.6276 | 0.7821 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.6247 | 0.9889 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6148 | 0.8546 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.9997 | 0.6094 | 0.7677 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | nan | 0.6078 | 0.7715 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.6054 | 0.9848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.888 | 0.6017 | 0.8152 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 1.1317 | 0.595 | 0.7558 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.842 | 0.3524 | 0.908 | 0.5787 | 0.8128 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.9904 | 0.5613 | 0.7534 | | RobertaForCausalLM | 64 | 1.0 | 0.9206 | 0.3642 | 0.989 | 0.5604 | 0.7519 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.4624 | 0.6 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 0.9713 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.3725 | 0.4638 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | nan | 1.2564 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | nan | 1.2385 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | nan | 0.9207 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.8429 | nan | 0.8074 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | nan | 0.6814 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | nan | 0.6814 | | BigBird | 1 | 0.9548 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | BartForConditionalGeneration | 2 | 149.7237 | 151.7676 | nan | 333.8761 | 142.4145 | 143.4966 | | BartForCausalLM | 4 | 123.3486 | 127.5014 | 163.4677 | 123.893 | 120.8657 | 119.5708 | | BlenderbotSmallForConditionalGeneration | 64 | 118.7832 | 128.8407 | nan | 125.5327 | 119.44 | 118.8827 | | PegasusForConditionalGeneration | 16 | 103.8102 | 105.6673 | 138.9564 | 115.9917 | 100.2623 | 100.5916 | | BertForMaskedLM | 64 | 100.9422 | 104.695 | 138.4714 | 103.9221 | 100.2344 | 99.9557 | | MBartForConditionalGeneration | 16 | 104.5121 | 107.1083 | 138.5967 | nan | 100.1899 | 100.659 | | MobileBertForQuestionAnswering | 64 | 127.5753 | 153.7274 | 155.1693 | nan | 99.7845 | 117.7768 | | RobertaForCausalLM | 64 | 109.2157 | 114.4832 | 146.5663 | 111.8208 | 99.2556 | 98.5883 | | ElectraForQuestionAnswering | 64 | 124.7356 | 126.6089 | nan | 100.205 | 91.9994 | 93.0002 | | PegasusForCausalLM | 32 | 85.4333 | 89.405 | 116.9449 | 90.5692 | 90.5555 | 89.2716 | | MBartForCausalLM | 32 | 85.9254 | 90.2866 | 117.3984 | nan | 90.357 | 89.1456 | | TrOCRForCausalLM | 32 | 85.9871 | 90.6682 | 117.6216 | 90.6685 | 90.159 | 88.9173 | | LayoutLMForSequenceClassification | 16 | 113.1048 | 114.29 | 153.5057 | 98.5792 | 89.8475 | 90.9712 | | DebertaForQuestionAnswering | 8 | 82.1493 | 83.7238 | 120.0699 | 94.2542 | 79.0337 | 67.8111 | | T5ForConditionalGeneration | 4 | 104.1232 | 111.485 | 143.5514 | 90.2589 | 73.6926 | 73.6555 | | MegatronBertForCausalLM | 16 | 78.0855 | 80.3627 | 108.4846 | 83.4267 | 73.5004 | 74.1509 | | XGLMForCausalLM | 8 | 79.2275 | 85.3835 | 108.9926 | 247.6184 | 71.0239 | 70.6073 | | MobileBertForMaskedLM | 32 | 127.1411 | 143.7185 | 117.094 | nan | 69.7427 | 89.8827 | | BlenderbotSmallForCausalLM | 64 | 64.7355 | 70.9133 | 94.5627 | 70.6954 | 69.4657 | 67.2495 | | MegatronBertForQuestionAnswering | 16 | 72.1894 | 74.0267 | 98.5954 | 84.7 | 67.5286 | 68.8309 | | M2M100ForConditionalGeneration | 8 | 85.9235 | 90.3719 | 101.8318 | 108.6385 | 64.3928 | 71.1554 | | DistilBertForMaskedLM | 64 | 63.2847 | 66.4508 | 89.6764 | 100.2313 | 63.4438 | 62.4474 | | GPT2ForSequenceClassification | 4 | 102.3437 | 104.4943 | nan | 142.8318 | 57.1507 | 57.3366 | | DebertaForMaskedLM | 4 | 65.403 | 73.0307 | 83.2591 | 92.7686 | 56.372 | 58.3184 | | OPTForCausalLM | 32 | 61.8919 | 66.4622 | 87.0769 | 135.7678 | 54.0268 | 53.2159 | | T5Small | 1 | 55.6801 | 67.9493 | 57.0683 | 55.6533 | 46.1819 | 49.9212 | | PLBartForConditionalGeneration | 16 | 48.8772 | 51.3558 | 60.9075 | 59.5711 | 43.0422 | 43.2389 | | PLBartForCausalLM | 32 | 41.1408 | 44.1927 | 57.9336 | 44.8984 | 42.7601 | 41.7854 | | MT5ForConditionalGeneration | 8 | 75.4532 | 82.0365 | 63.432 | 73.9233 | 36.2563 | 41.6505 | | DistilBertForQuestionAnswering | 64 | 39.7725 | 40.2815 | 55.7904 | 77.993 | 35.3766 | 35.9562 | | Speech2Text2ForCausalLM | 128 | 35.1736 | 37.9462 | 53.3276 | 38.4826 | 34.0958 | 33.3368 | | YituTechConvBert | 1 | 48.9351 | 66.7087 | 28.2815 | 66.0346 | 16.269 | 37.462 | | CamemBert | 1 | 29.979 | 32.0871 | 23.0917 | 39.9523 | 13.8603 | 21.6441 | | GoogleFnet | 1 | 19.1486 | 22.9075 | 19.0226 | nan | 11.4206 | 17.1398 | | DistillGPT2 | 1 | 17.2193 | 18.696 | 16.5123 | nan | 8.8886 | 10.0909 | | AlbertForMaskedLM | 4 | 385.2371 | 384.0606 | nan | 314.6874 | nan | 311.7989 | | AlbertForQuestionAnswering | 4 | 382.9384 | 380.8459 | nan | 310.2507 | nan | 308.3997 | | BertForQuestionAnswering | 128 | 147.528 | 148.2786 | nan | 143.8111 | nan | 139.5803 | | RobertaForQuestionAnswering | 128 | 148.1055 | 149.352 | nan | 144.1785 | nan | 139.0858 | | LayoutLMForMaskedLM | 16 | 136.7715 | 140.6795 | nan | 126.1695 | nan | 117.9552 | | ElectraForCausalLM | 32 | 105.6598 | 113.19 | nan | 101.9126 | nan | 76.4966 | | BigBird | 1 | 183.5998 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9996 | 0.9734 | 0.8293 | 1.2485 | 1.8685 | 1.8264 | | lcnet_050 | 128 | 0.954 | 0.9508 | 0.7697 | 1.3599 | 1.6606 | 1.6287 | | regnety_002 | 128 | 0.9773 | 1.0012 | 0.9022 | 0.9696 | 1.4911 | 1.3232 | | dm_nfnet_f0 | 128 | 0.9993 | 0.9997 | 0.0 | 1.1272 | 1.472 | 1.4219 | | hrnet_w18 | 128 | 0.9999 | 0.9988 | 0.0 | 1.2755 | 1.4171 | 1.3773 | | dla102 | 128 | 0.9999 | 0.9989 | 0.0 | 1.2834 | 1.3828 | 1.3686 | | xcit_large_24_p8_224 | 5 | 1.0022 | 0.9964 | 0.7857 | 0.0 | 1.3739 | 1.3705 | | nfnet_l0 | 128 | 0.9994 | 0.7888 | 0.0 | 1.1014 | 1.369 | 1.3184 | | res2net50_14w_8s | 128 | 0.9999 | 0.9991 | 0.0 | 1.2513 | 1.3564 | 1.3243 | | volo_d1_224 | 64 | 0.9999 | 0.9955 | 0.8006 | 0.0 | 1.3517 | 1.328 | | mobilenetv2_100 | 128 | 0.9667 | 0.9642 | 0.7052 | 1.2844 | 1.3377 | 1.3566 | | mobilenetv3_large_100 | 128 | 0.9659 | 0.9628 | 0.7658 | 1.2857 | 1.3369 | 1.3483 | | inception_v3 | 128 | 1.0 | 0.9981 | 0.0 | 1.1269 | 1.327 | 1.3075 | | adv_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 1.1291 | 1.327 | 1.3081 | | gluon_inception_v3 | 128 | 0.9999 | 0.9989 | 0.0 | 1.1294 | 1.3265 | 1.3068 | | resnest101e | 64 | 0.9998 | 1.0031 | 0.0 | 1.1692 | 1.3123 | 1.2681 | | res2next50 | 128 | 0.9998 | 1.0006 | 0.0 | 1.1806 | 1.3099 | 1.274 | | fbnetv3_b | 128 | 0.9647 | 0.9613 | 0.7629 | 1.2398 | 1.2858 | 1.2911 | | crossvit_9_240 | 128 | 0.9996 | 0.9944 | 0.7584 | 1.0402 | 1.2722 | 1.2485 | | coat_lite_mini | 128 | 1.0 | 1.0005 | 0.8454 | 1.093 | 1.2717 | 1.2585 | | botnet26t_256 | 128 | 0.9858 | 0.9853 | 0.7904 | 0.0 | 1.2712 | 1.2666 | | selecsls42b | 128 | 1.0 | 0.9987 | 0.8133 | 1.215 | 1.2672 | 1.2524 | | mnasnet_100 | 128 | 0.9663 | 0.9635 | 0.7859 | 1.2531 | 1.2671 | 1.28 | | tf_efficientnet_b0 | 128 | 0.9773 | 0.7833 | 0.0 | 1.1628 | 1.2583 | 1.2683 | | gmixer_24_224 | 128 | 1.0 | 0.8349 | 0.0 | 1.0857 | 1.2561 | 1.2657 | | fbnetc_100 | 128 | 0.9659 | 0.9627 | 0.7914 | 1.2437 | 1.2514 | 1.2655 | | eca_botnext26ts_256 | 128 | 0.987 | 0.7719 | 0.0 | 0.0 | 1.25 | 1.2358 | | sebotnet33ts_256 | 64 | 0.9763 | 0.8071 | 0.0 | 0.0 | 1.2495 | 1.25 | | ese_vovnet19b_dw | 128 | 0.9792 | 0.9775 | 0.7424 | 1.149 | 1.2415 | 1.2463 | | eca_halonext26ts | 128 | 0.9872 | 0.779 | 0.0 | 0.0 | 1.235 | 1.2285 | | spnasnet_100 | 128 | 0.9602 | 0.9551 | 0.7747 | 1.2222 | 1.2326 | 1.2542 | | res2net101_26w_4s | 64 | 0.9999 | 0.9963 | 0.7737 | 1.099 | 1.2269 | 1.1889 | | jx_nest_base | 32 | 0.9999 | 0.9949 | 0.7373 | 0.0 | 1.2254 | 1.2006 | | cspdarknet53 | 64 | 0.9575 | 0.9526 | 0.7354 | 1.173 | 1.2203 | 1.2285 | | rexnet_100 | 128 | 0.9728 | 0.8161 | 0.0 | 1.1603 | 1.214 | 1.2195 | | convit_base | 64 | 0.9997 | 0.9986 | 0.0 | 0.0 | 1.2092 | 1.189 | | pnasnet5large | 16 | 0.9996 | 0.9985 | 0.0 | 1.0896 | 1.2089 | 1.1868 | | dpn107 | 32 | 0.9575 | 0.9499 | 0.782 | 1.0262 | 1.1904 | 1.202 | | tinynet_a | 128 | 0.9658 | 0.7745 | 0.6211 | 1.1485 | 1.19 | 1.1966 | | gmlp_s16_224 | 128 | 0.9999 | 0.9967 | 0.0 | 1.0881 | 1.1847 | 1.1721 | | tf_mixnet_l | 128 | 0.9853 | 0.8898 | 0.0 | 1.0942 | 1.1683 | 1.1655 | | poolformer_m36 | 64 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1659 | 1.1477 | | pit_b_224 | 64 | 0.9998 | 0.9991 | 0.0 | 1.0282 | 1.1655 | 1.1554 | | mobilevit_s | 64 | 0.9793 | 0.7603 | 0.0 | 0.0 | 1.1593 | 1.1576 | | mixnet_l | 128 | 0.9849 | 0.8859 | 0.0 | 1.0987 | 1.1478 | 1.1454 | | repvgg_a2 | 128 | 0.9644 | 0.9632 | 0.8286 | 1.1348 | 1.1438 | 1.1453 | | convnext_base | 64 | 0.9999 | 0.9986 | 0.0 | 0.0 | 1.1254 | 1.108 | | cait_m36_384 | 4 | 0.9999 | 0.0 | 0.0 | 0.0 | 1.1194 | 1.0977 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9981 | 0.0 | 1.1076 | 1.106 | 1.0719 | | twins_pcpvt_base | 64 | 1.0001 | 0.9988 | 0.7498 | 0.0 | 1.0893 | 1.057 | | gluon_xception65 | 32 | 0.9998 | 0.9971 | 0.0 | 1.0807 | 1.0867 | 1.0744 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9794 | 0.0 | 0.0 | 1.0861 | 1.0722 | | convmixer_768_32 | 32 | 0.9998 | 0.9999 | 0.0 | 0.0 | 1.0777 | 1.0746 | | gernet_l | 128 | 0.9735 | 0.9729 | 0.8251 | 1.0983 | 1.0764 | 1.0715 | | beit_base_patch16_224 | 64 | 0.9997 | 0.981 | 0.0 | 0.0 | 1.0676 | 1.057 | | mixer_b16_224 | 128 | 0.9991 | 0.9992 | 0.0 | 0.8931 | 1.0603 | 1.0544 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9988 | 0.7677 | 0.9806 | 1.0545 | 1.0429 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9986 | 0.7679 | 0.9508 | 1.0462 | 1.0348 | | visformer_small | 128 | 0.9998 | 1.003 | 0.7967 | 0.0 | 1.0396 | 1.0092 | | resmlp_12_224 | 128 | 0.9998 | 0.9987 | 0.6955 | 1.2133 | 0.9947 | 0.9434 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9994 | 0.0 | 0.0 | 0.0 | 1.4899 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.2988 | 10.4143 | 18.8403 | nan | 119.8398 | 114.2276 | | hrnet_w18 | 128 | 6.3549 | 24.7566 | nan | 670.9958 | 116.2317 | 109.6435 | | xcit_large_24_p8_224 | 5 | 2.8885 | 13.5408 | 26.0025 | nan | 78.6031 | 74.6303 | | swin_base_patch4_window7_224 | 64 | 2.7664 | 10.1283 | nan | nan | 77.3379 | 74.8327 | | mobilevit_s | 64 | 1.7315 | 6.2573 | nan | nan | 77.3148 | 74.9936 | | pnasnet5large | 16 | 4.6189 | 18.5833 | nan | 364.031 | 75.786 | 74.3684 | | cait_m36_384 | 4 | 3.0729 | nan | nan | nan | 71.3175 | 68.2166 | | convnext_base | 64 | 1.3178 | 5.2766 | nan | nan | 63.2827 | 61.1582 | | resnest101e | 64 | 3.258 | 12.839 | nan | 285.077 | 61.8797 | 58.5066 | | jx_nest_base | 32 | 1.9244 | 7.2531 | 12.2915 | nan | 58.2694 | 54.1362 | | res2net101_26w_4s | 64 | 3.136 | 13.5483 | 23.2394 | 252.5273 | 52.9613 | 49.5448 | | coat_lite_mini | 128 | 1.0963 | 4.1332 | 6.8523 | 97.518 | 50.5024 | 49.9392 | | eca_halonext26ts | 128 | 1.425 | 4.4242 | nan | nan | 50.1666 | 48.5187 | | res2net50_14w_8s | 128 | 3.0163 | 12.1715 | nan | 259.8104 | 47.9391 | 45.7885 | | poolformer_m36 | 64 | 1.7239 | 6.308 | nan | nan | 47.8668 | 45.4714 | | sebotnet33ts_256 | 64 | 1.7184 | 5.2454 | nan | nan | 43.6684 | 42.3163 | | gmlp_s16_224 | 128 | 1.0589 | 5.4592 | nan | 155.7467 | 39.9072 | 38.4502 | | dpn107 | 32 | 4.3927 | 11.9348 | 36.2164 | 177.344 | 39.1511 | 36.7341 | | crossvit_9_240 | 128 | 1.4838 | 6.6458 | 10.748 | 175.0564 | 38.0383 | 36.4877 | | gluon_xception65 | 32 | 1.9616 | 8.7968 | nan | 157.0878 | 37.3869 | 34.6288 | | fbnetv3_b | 128 | 3.2204 | 9.4882 | 25.0053 | 247.4742 | 37.371 | 35.6436 | | volo_d1_224 | 64 | 1.302 | 6.1135 | 10.2725 | nan | 37.2314 | 36.7579 | | eca_botnext26ts_256 | 128 | 1.3673 | 4.249 | nan | nan | 36.6841 | 34.9922 | | botnet26t_256 | 128 | 1.381 | 3.75 | 8.4388 | nan | 33.448 | 31.6722 | | adv_inception_v3 | 128 | 1.6492 | 6.8899 | nan | 152.1529 | 33.2694 | 30.8976 | | inception_v3 | 128 | 1.6819 | 6.8587 | nan | 145.2078 | 33.0758 | 30.7976 | | gluon_inception_v3 | 128 | 1.6354 | 6.8407 | nan | 147.601 | 32.9697 | 31.3781 | | ghostnet_100 | 128 | 2.8871 | 8.1161 | 12.1209 | 167.2503 | 32.5573 | 30.4794 | | tf_mixnet_l | 128 | 6.0426 | 11.3113 | nan | 157.287 | 32.3951 | 30.6235 | | dla102 | 128 | 1.8223 | 7.8307 | nan | 185.4614 | 32.0857 | 29.1377 | | gmixer_24_224 | 128 | 1.1555 | 5.7996 | nan | 140.2644 | 31.8984 | 29.8985 | | mixnet_l | 128 | 5.4274 | 10.9285 | nan | 153.5961 | 31.1626 | 29.3775 | | swsl_resnext101_32x16d | 32 | 1.7608 | 7.4594 | nan | 128.3656 | 30.9386 | 28.786 | | dm_nfnet_f0 | 128 | 2.0545 | 6.0507 | nan | 157.7834 | 30.2638 | 28.7845 | | convit_base | 64 | 1.1265 | 4.5887 | nan | nan | 29.1617 | 27.1641 | | res2next50 | 128 | 1.605 | 6.7843 | nan | 162.8949 | 27.9016 | 26.2596 | | rexnet_100 | 128 | 1.9762 | 6.4168 | nan | 147.5297 | 26.4725 | 24.8187 | | tinynet_a | 128 | 2.1013 | 6.8693 | 17.7171 | 142.9224 | 26.0694 | 25.0329 | | tf_efficientnet_b0 | 128 | 1.8039 | 5.7734 | nan | 132.1516 | 23.5637 | 21.5316 | | mixer_b16_224 | 128 | 0.7707 | 3.0629 | nan | 72.0798 | 22.7604 | 21.6974 | | cspdarknet53 | 64 | 2.3131 | 6.2179 | 17.1888 | 121.7403 | 22.5478 | 21.1251 | | visformer_small | 128 | 0.9744 | 3.4625 | 5.3958 | nan | 22.4183 | 20.6691 | | resmlp_12_224 | 128 | 0.6603 | 2.4194 | 3.7983 | 32.7857 | 22.3738 | 21.0301 | | convmixer_768_32 | 32 | 1.2304 | 4.9432 | nan | nan | 21.8854 | 20.5077 | | spnasnet_100 | 128 | 2.0753 | 5.6067 | 15.2849 | 110.8186 | 21.8077 | 19.9741 | | nfnet_l0 | 128 | 1.945 | 6.2714 | nan | 136.5999 | 21.6618 | 20.7701 | | fbnetc_100 | 128 | 2.0875 | 5.728 | 15.5987 | 113.3847 | 21.4793 | 20.1376 | | pit_b_224 | 64 | 0.9214 | 3.9621 | nan | 99.0031 | 20.4072 | 18.6982 | | beit_base_patch16_224 | 64 | 1.1826 | 4.4899 | nan | nan | 20.2332 | 18.923 | | repvgg_a2 | 128 | 2.0263 | 5.2025 | 14.2128 | 158.7531 | 20.026 | 18.9285 | | mobilenetv3_large_100 | 128 | 1.5651 | 4.8245 | 11.9021 | 121.3766 | 19.9869 | 18.8334 | | deit_base_distilled_patch16_224 | 64 | 0.9136 | 3.4854 | 5.7303 | 71.6446 | 19.4771 | 18.7342 | | vit_base_patch16_224 | 64 | 0.9019 | 3.4764 | 5.5769 | 79.7797 | 19.2099 | 17.9912 | | mobilenetv2_100 | 128 | 1.7542 | 4.7905 | 12.1019 | 101.16 | 18.6919 | 17.8254 | | regnety_002 | 128 | 1.5858 | 4.5742 | 12.4221 | 96.1469 | 18.1176 | 17.0258 | | mnasnet_100 | 128 | 1.7303 | 4.5678 | 11.9348 | 91.172 | 17.9741 | 17.3424 | | gernet_l | 128 | 1.9907 | 5.2232 | 13.9056 | 87.815 | 17.8564 | 16.7339 | | selecsls42b | 128 | 0.8911 | 3.0354 | 5.1056 | 76.3427 | 15.7698 | 14.8563 | | lcnet_050 | 128 | 1.0443 | 2.8886 | 6.8253 | 69.7129 | 13.1906 | 12.2454 | | ese_vovnet19b_dw | 128 | 1.0365 | 2.5922 | 6.8456 | 55.3995 | 12.4048 | 11.8518 | | tnt_s_patch16_224 | 128 | 1.695 | 8.1192 | nan | nan | nan | 32.4489 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.9898 | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.2908 | 1.4944 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1876 | 1.3423 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.359 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1137 | 1.1665 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1107 | 1.3329 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1106 | 1.3589 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 1.0219 | 1.0956 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 0.9672 | 1.0416 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.9669 | 1.0504 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9632 | 1.0419 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9615 | 1.0491 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.961 | 1.061 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9589 | 1.059 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9569 | 1.058 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.9519 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9403 | 0.9919 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.9379 | 1.0122 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9367 | 1.0739 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9069 | 1.0618 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9061 | 1.0578 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9058 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9034 | 0.9939 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9609 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.8514 | 1.0359 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8365 | 0.965 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8169 | 1.0652 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8029 | 0.811 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.737 | 1.0402 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.6848 | 0.8081 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.5717 | 0.7352 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6571 | 0.5319 | 0.8171 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.7096 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 364.8698 | 364.918 | nan | nan | 338.9375 | 339.4406 | | hrnet_w18 | 128 | 416.5177 | 415.894 | nan | 326.4175 | 293.258 | 301.6247 | | pnasnet5large | 16 | 289.1291 | 289.1128 | nan | 264.9093 | 239.3291 | 243.7643 | | convnext_base | 64 | 263.4238 | 264.0859 | nan | nan | 234.7067 | 238.5804 | | swin_base_patch4_window7_224 | 64 | 237.0811 | 242.0348 | nan | nan | 218.1945 | 220.7139 | | tf_mixnet_l | 128 | 256.4614 | 284.0631 | nan | 231.0512 | 216.4825 | 216.9337 | | mixnet_l | 128 | 247.0726 | 274.8069 | nan | 221.567 | 212.044 | 212.4941 | | swsl_resnext101_32x16d | 32 | 219.2918 | 219.4398 | nan | 197.9316 | 198.0351 | 205.0518 | | dla102 | 128 | 269.1889 | 269.4027 | nan | 209.7056 | 194.6197 | 196.6122 | | cait_m36_384 | 4 | 216.7592 | nan | nan | nan | 193.7713 | 197.3846 | | resnest101e | 64 | 229.7688 | 228.8945 | nan | 196.7491 | 175.1846 | 181.0874 | | adv_inception_v3 | 128 | 226.158 | 226.5703 | nan | 200.3605 | 170.7439 | 172.9212 | | gluon_inception_v3 | 128 | 226.3195 | 226.4744 | nan | 200.2584 | 170.5963 | 173.4009 | | inception_v3 | 128 | 226.1401 | 226.6428 | nan | 200.7947 | 170.5446 | 172.9838 | | res2net50_14w_8s | 128 | 229.2329 | 229.0718 | nan | 183.2636 | 169.0043 | 172.8107 | | gluon_xception65 | 32 | 182.5574 | 182.8602 | nan | 168.7724 | 167.8976 | 169.7422 | | convit_base | 64 | 196.6623 | 196.7124 | nan | nan | 162.6476 | 165.2568 | | res2next50 | 128 | 206.4752 | 206.2994 | nan | 175.1915 | 157.6175 | 162.0399 | | dpn107 | 32 | 190.8717 | 192.3707 | 233.9357 | 178.0168 | 153.4274 | 152.167 | | coat_lite_mini | 128 | 191.4252 | 191.4575 | 226.4606 | 175.0943 | 150.7318 | 152.1105 | | mixer_b16_224 | 128 | 158.8331 | 158.958 | nan | 176.9978 | 150.5696 | 151.5773 | | poolformer_m36 | 64 | 174.2899 | 174.5577 | nan | nan | 149.6409 | 151.8762 | | gernet_l | 128 | 165.2302 | 165.1449 | 193.3376 | 146.3421 | 149.2756 | 149.8654 | | dm_nfnet_f0 | 128 | 206.1922 | 205.8601 | nan | 182.8789 | 139.6823 | 144.8488 | | pit_b_224 | 64 | 158.5216 | 158.5332 | nan | 153.9577 | 135.9971 | 136.9433 | | eca_halonext26ts | 128 | 169.1169 | 214.4222 | nan | nan | 135.2831 | 135.9393 | | eca_botnext26ts_256 | 128 | 163.2189 | 208.5178 | nan | nan | 128.7714 | 130.3128 | | nfnet_l0 | 128 | 175.8798 | 223.2413 | nan | 160.0299 | 128.4513 | 133.5073 | | gmlp_s16_224 | 128 | 151.8801 | 152.4785 | nan | 139.5092 | 128.2392 | 129.6096 | | twins_pcpvt_base | 64 | 137.2927 | 137.2626 | 182.5868 | nan | 125.8347 | 129.4853 | | res2net101_26w_4s | 64 | 151.7669 | 151.9041 | 195.6583 | 137.6509 | 123.483 | 127.6196 | | visformer_small | 128 | 128.2611 | 127.9184 | 160.7667 | nan | 123.2122 | 127.0748 | | fbnetv3_b | 128 | 162.2805 | 162.8745 | 205.3566 | 126.3788 | 121.878 | 121.3127 | | beit_base_patch16_224 | 64 | 129.0057 | 131.0602 | nan | nan | 120.4964 | 121.6923 | | botnet26t_256 | 128 | 151.9746 | 152.1519 | 189.8125 | nan | 117.9622 | 118.2873 | | gmixer_24_224 | 128 | 146.387 | 175.363 | nan | 134.8156 | 116.5803 | 115.5722 | | deit_base_distilled_patch16_224 | 64 | 120.6345 | 119.9848 | 156.2627 | 122.9875 | 114.5245 | 115.7226 | | vit_base_patch16_224 | 64 | 119.2326 | 119.2333 | 155.2682 | 125.2242 | 114.0692 | 115.7388 | | volo_d1_224 | 64 | 153.4824 | 154.0129 | 191.5485 | nan | 113.4761 | 115.4131 | | repvgg_a2 | 128 | 126.9719 | 127.2875 | 146.1895 | 108.0863 | 107.0755 | 106.9849 | | tf_efficientnet_b0 | 128 | 133.6702 | 166.9221 | nan | 112.4386 | 103.9456 | 102.9978 | | cspdarknet53 | 64 | 130.2644 | 130.9866 | 169.5671 | 106.4483 | 102.1346 | 101.5151 | | xcit_large_24_p8_224 | 5 | 135.2832 | 135.3284 | 172.7944 | nan | 99.1719 | 101.4937 | | jx_nest_base | 32 | 121.2763 | 121.8316 | 164.5614 | nan | 99.0322 | 101.0075 | | mobilevit_s | 64 | 116.9387 | 150.662 | nan | nan | 98.849 | 98.8883 | | rexnet_100 | 128 | 118.9961 | 141.87 | nan | 99.7599 | 95.3277 | 94.9502 | | fbnetc_100 | 128 | 123.3125 | 123.6719 | 150.706 | 95.7706 | 95.2166 | 94.0882 | | sebotnet33ts_256 | 64 | 114.5015 | 138.3745 | nan | nan | 89.2962 | 89.2253 | | tinynet_a | 128 | 109.8718 | 137.0436 | 170.8319 | 92.3002 | 89.0836 | 88.6794 | | spnasnet_100 | 128 | 105.9156 | 106.4435 | 131.3893 | 83.1224 | 82.6195 | 81.1072 | | ese_vovnet19b_dw | 128 | 99.4198 | 99.6376 | 131.2123 | 84.7803 | 78.405 | 78.0832 | | crossvit_9_240 | 128 | 98.4292 | 99.0283 | 129.9609 | 94.5808 | 77.424 | 78.8338 | | mnasnet_100 | 128 | 98.5076 | 98.8759 | 121.2435 | 75.9835 | 75.211 | 74.4059 | | resmlp_12_224 | 128 | 71.0779 | 71.2133 | 102.4175 | 58.6397 | 71.7177 | 75.4166 | | selecsls42b | 128 | 89.5358 | 89.6469 | 110.1123 | 73.6871 | 70.6458 | 71.5325 | | mobilenetv2_100 | 128 | 97.495 | 97.7177 | 133.8093 | 73.3627 | 70.4262 | 69.4145 | | mobilenetv3_large_100 | 128 | 85.3749 | 85.6763 | 107.9076 | 64.1723 | 61.773 | 61.1738 | | ghostnet_100 | 128 | 114.3994 | 117.6129 | 138.2058 | 91.7125 | 61.3104 | 62.6696 | | regnety_002 | 128 | 52.0852 | 51.2026 | 62.6241 | 56.2635 | 34.9215 | 39.8824 | | lcnet_050 | 128 | 38.322 | 38.4833 | 47.5925 | 26.9335 | 22.0367 | 22.4606 | | tnt_s_patch16_224 | 128 | 469.8618 | 469.7959 | nan | nan | nan | 315.3174 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/2aeOlmH.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/PjbSOeW.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/nNaz4pt.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 95%, 58/61  |
|     aot_cudagraphs     | 85%, 46/54 | 86%, 36/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 59%, 32/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 83%, 45/54 | 90%, 38/42  | 92%, 56/61  |
| inductor_no_cudagraphs | 89%, 48/54 | 90%, 38/42  | 92%, 56/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.13x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.04x    |    1.09x    |
|        inductor        |   1.87x    |    1.73x    |    1.40x    |
| inductor_no_cudagraphs |   1.37x    |    1.52x    |    1.35x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.02    |    2.77     |    2.27     |
|       aot_eager        |    6.41    |    9.89     |    8.53     |
|     aot_cudagraphs     |    9.49    |    17.59    |    16.11    |
|    nvprims_nvfuser     |   66.76    |   133.58    |   148.52    |
|        inductor        |   32.71    |    37.94    |    43.00    |
| inductor_no_cudagraphs |   32.54    |    33.20    |    41.01    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.82x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.94x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | dlrm | 0.9445 | 1.1847 | | torchbench | hf_GPT2_large | 0.0 | 1.8623 | | torchbench | tacotron2 | 0.0 | 0.8842 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6628 | 0.6485 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 396.8923 | 405.2775 | | torchbench | timm_efficientdet | 142.0943 | 142.3507 | | torchbench | hf_T5_large | 138.224 | 138.1156 | | timm_models | hrnet_w18 | 142.0493 | 134.58 | | timm_models | twins_pcpvt_base | 126.7577 | 125.0894 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | speech_transformer | 0.8804 | 0.8866 | | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8778 | 1.0938 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8658 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | dlrm | 0.704 | 0.7306 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0014 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4478 | 0.4688 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8829 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9876 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9513 | | huggingface | DistillGPT2 | 0.8047 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.7464 | 0.9888 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.026 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | convnext_base | 0.8729 | 0.9865 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8294 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/j1DDXUz.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/hTIrnn6.png)

Accuracy Regressions

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0006 | 0.9276 | 2.5158 | 0.7313 | 5.8115 | 1.321 | | functorch_dp_cifar10 | 64 | 1.0032 | 0.9559 | 2.3995 | 0.0 | 5.4544 | 1.3374 | | timm_efficientdet | 1 | 0.9861 | 0.8115 | 2.1062 | 0.0 | 4.7843 | 1.5321 | | resnext50_32x4d | 8 | 1.0018 | 0.9675 | 1.8927 | 0.7456 | 3.6954 | 1.2671 | | BERT_pytorch | 16 | 1.0159 | 0.8436 | 1.5817 | 0.8419 | 3.3505 | 2.3211 | | timm_vision_transformer | 8 | 1.0107 | 0.8492 | 1.7671 | 0.6017 | 3.3156 | 1.5577 | | mobilenet_v3_large | 32 | 1.0039 | 1.0041 | 1.4892 | 0.7707 | 3.0104 | 1.3906 | | drq | 1 | 1.0106 | 0.822 | 1.8974 | 0.6272 | 2.9883 | 1.1384 | | resnet18 | 16 | 1.0063 | 0.999 | 1.5639 | 0.8134 | 2.9278 | 1.2058 | | dcgan | 32 | 0.9795 | 0.9267 | 1.6922 | 0.7567 | 2.846 | 1.0424 | | hf_T5_large | 2 | 1.0181 | 0.8628 | 0.0 | 0.0 | 2.774 | 2.1233 | | mnasnet1_0 | 32 | 1.0016 | 1.0168 | 1.2754 | 0.779 | 2.7329 | 1.3507 | | squeezenet1_1 | 32 | 0.9974 | 0.9668 | 1.4502 | 0.744 | 2.4967 | 1.2861 | | hf_Albert | 8 | 1.0016 | 0.956 | 0.7745 | 0.0 | 2.3295 | 2.2738 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9987 | 0.9868 | 1.774 | 0.0 | 2.1474 | 1.5674 | | lennard_jones | 1000 | 0.9683 | 0.779 | 1.2842 | 0.4669 | 2.1088 | 1.0462 | | timm_efficientnet | 32 | 0.9583 | 0.8095 | 1.0947 | 0.6858 | 2.1034 | 1.2792 | | pytorch_struct | 200 | 0.9851 | 0.7524 | 1.036 | 0.6017 | 2.0929 | 1.2669 | | hf_Bert | 4 | 1.0381 | 0.8673 | 0.9468 | 0.0 | 2.0229 | 1.8207 | | resnet152 | 32 | 1.0008 | 1.0039 | 1.2577 | 0.0 | 2.012 | 1.3053 | | hf_GPT2 | 4 | 1.0187 | 0.9827 | 0.8819 | 0.2931 | 1.9157 | 1.8965 | | timm_resnest | 32 | 1.0035 | 1.0253 | 0.8437 | 0.968 | 1.9085 | 1.6902 | | hf_T5 | 8 | 1.0004 | 0.92 | 0.0 | 1.3557 | 1.8664 | 1.8725 | | LearningToPaint | 96 | 1.0051 | 1.0108 | 1.1555 | 0.842 | 1.8299 | 1.3213 | | soft_actor_critic | 256 | 0.9845 | 0.7638 | 1.332 | 0.5414 | 1.74 | 1.0459 | | resnet50 | 32 | 1.0002 | 1.0203 | 1.0523 | 0.816 | 1.7359 | 1.3537 | | hf_Bart | 4 | 1.0137 | 0.8466 | 0.8923 | 0.0 | 1.7225 | 1.6745 | | shufflenet_v2_x1_0 | 128 | 0.999 | 1.0159 | 0.9674 | 0.8463 | 1.6891 | 1.4309 | | mobilenet_v2 | 96 | 0.9996 | 0.9893 | 0.7601 | 1.0475 | 1.559 | 1.5155 | | attention_is_all_you_need_pytorch | 256 | 1.0092 | 0.9243 | 0.8385 | 0.0 | 1.5555 | 1.4726 | | speech_transformer | 32 | 1.0003 | 0.8314 | 1.747 | 0.0 | 1.531 | 1.56 | | timm_nfnet | 128 | 0.9989 | 0.9995 | 0.8775 | 0.9197 | 1.5075 | 1.4314 | | fastNLP_Bert | 6 | 0.9984 | 0.9124 | 0.766 | 0.0 | 1.4996 | 1.4519 | | hf_DistilBert | 8 | 1.0003 | 0.9716 | 0.7434 | 0.3693 | 1.4833 | 1.499 | | pytorch_stargan | 16 | 0.9952 | 1.1161 | 1.0427 | 0.0 | 1.4451 | 1.3833 | | pytorch_unet | 1 | 0.9994 | 0.9923 | 0.8644 | 1.0815 | 1.3615 | 1.3319 | | timm_regnet | 32 | 0.9778 | 0.9452 | 0.8993 | 0.7858 | 1.3252 | 1.2303 | | timm_vovnet | 32 | 0.9203 | 0.8865 | 0.9229 | 0.8008 | 1.2808 | 1.1527 | | vgg16 | 64 | 1.0 | 0.9978 | 0.8581 | 0.973 | 1.2732 | 1.2665 | | Background_Matting | 4 | 0.9995 | 1.0174 | 0.8969 | 1.0567 | 1.238 | 1.2199 | | Super_SloMo | 6 | 0.9993 | 0.9948 | 0.8862 | 0.0 | 1.2253 | 1.1964 | | alexnet | 128 | 0.9989 | 0.9981 | 0.8155 | 0.9338 | 1.211 | 1.2095 | | hf_Reformer | 4 | 0.9975 | 1.0009 | 0.993 | 0.6594 | 1.1751 | 1.1765 | | timm_vision_transformer_large | 8 | 1.0 | 0.9911 | 0.0 | 0.0 | 1.0919 | 1.0672 | | yolov3 | 16 | 0.9998 | 0.9908 | 0.8067 | 0.0 | 1.0875 | 1.0653 | | tts_angular | 64 | 0.9825 | 0.9506 | 1.0002 | 0.9719 | 1.0028 | 1.0067 | | demucs | 4 | 1.0003 | 0.9983 | 0.9995 | 0.9995 | 1.001 | 1.0005 | | nvidia_deeprecommender | 256 | 0.9987 | 0.996 | 0.6968 | 1.0069 | 0.9891 | 1.0305 | | dlrm | 2048 | 1.0533 | 1.1526 | 0.0 | 1.0785 | 0.9445 | 1.1847 | | hf_GPT2_large | 4 | 1.0005 | 0.9912 | 0.0 | 0.0 | 0.0 | 1.8623 | | tacotron2 | 64 | 0.9845 | 0.7696 | 1.0046 | 0.6037 | 0.0 | 0.8842 | | hf_BigBird | 2 | 0.9819 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | fail_to_run | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.065 | 8.4462 | 11.9444 | nan | 396.8923 | 405.2775 | | timm_efficientdet | 1 | 20.3726 | 37.2553 | 77.8407 | nan | 142.0943 | 142.3507 | | hf_T5_large | 2 | 14.3798 | 39.3004 | nan | nan | 138.224 | 138.1156 | | timm_vision_transformer_large | 8 | 3.122 | 15.0511 | nan | nan | 67.6933 | 67.6834 | | resnet152 | 32 | 2.7367 | 13.25 | 22.0731 | nan | 52.02 | 50.4775 | | densenet121 | 4 | 2.4305 | 11.9831 | 19.3474 | 247.4363 | 50.4813 | 49.4294 | | attention_is_all_you_need_pytorch | 256 | 1.467 | 7.0527 | 11.3779 | nan | 39.456 | 38.4595 | | timm_resnest | 32 | 0.65 | 2.5176 | 3.8589 | 67.0897 | 38.9665 | 38.6802 | | BERT_pytorch | 16 | 1.8486 | 7.5855 | 11.469 | 136.0485 | 35.6548 | 34.4088 | | speech_transformer | 32 | 2.0228 | 9.3011 | 33.5641 | nan | 35.6019 | 34.4415 | | timm_vision_transformer | 8 | 1.0287 | 4.6383 | 6.6173 | 86.0508 | 35.525 | 35.3181 | | hf_Bart | 4 | 2.0782 | 8.7583 | 13.9066 | nan | 34.8192 | 34.7762 | | timm_nfnet | 128 | 2.1492 | 7.0821 | 10.5145 | 167.5076 | 32.0691 | 31.4523 | | fastNLP_Bert | 6 | 1.8671 | 7.1598 | 11.82 | nan | 31.9791 | 29.934 | | hf_T5 | 8 | 2.688 | 8.8507 | nan | 107.487 | 31.1163 | 30.5634 | | timm_regnet | 32 | 2.5059 | 8.0707 | 20.0185 | 142.6338 | 28.9258 | 27.562 | | pytorch_stargan | 16 | 0.4719 | 1.9582 | 2.9601 | nan | 27.4067 | 26.971 | | timm_efficientnet | 32 | 1.9365 | 6.7238 | 16.08 | 149.5086 | 26.7674 | 26.6727 | | mobilenet_v3_large | 32 | 1.0405 | 4.7831 | 7.441 | 121.4985 | 25.8321 | 25.3097 | | hf_Bert | 4 | 1.8565 | 7.0608 | 10.4271 | nan | 24.0507 | 22.6883 | | hf_Albert | 8 | 1.6134 | 6.442 | 10.0873 | nan | 22.0157 | 21.3301 | | functorch_dp_cifar10 | 64 | 0.3494 | 1.411 | 2.1193 | nan | 21.8939 | 22.3051 | | mnasnet1_0 | 32 | 0.9404 | 4.2732 | 6.6759 | 86.6758 | 21.0284 | 20.5944 | | resnet50 | 32 | 1.0009 | 4.5703 | 6.9909 | 100.3474 | 20.3795 | 20.05 | | resnext50_32x4d | 8 | 1.0273 | 4.6233 | 6.8276 | 83.812 | 20.2143 | 19.7119 | | shufflenet_v2_x1_0 | 128 | 1.0902 | 5.0077 | 7.6493 | 103.3309 | 20.1397 | 20.02 | | hf_GPT2 | 4 | 1.7153 | 6.2466 | 10.0249 | 113.2022 | 19.9479 | 19.384 | | timm_vovnet | 32 | 1.5722 | 4.4098 | 9.8226 | 71.6804 | 19.8297 | 19.5186 | | mobilenet_v2 | 96 | 0.9408 | 4.5121 | 7.0761 | 117.6057 | 19.1549 | 19.0243 | | pytorch_struct | 200 | 0.2894 | 0.8574 | 1.4916 | 7.6973 | 18.856 | 19.5693 | | Background_Matting | 4 | 0.9667 | 4.4018 | 6.5582 | 95.7802 | 18.3653 | 17.1587 | | hf_Reformer | 4 | 1.6656 | 2.9886 | 5.4557 | 17.4189 | 18.296 | 15.9486 | | Super_SloMo | 6 | 0.9845 | 3.9874 | 5.8241 | nan | 16.5626 | 16.0884 | | hf_DistilBert | 8 | 0.8201 | 3.4066 | 5.7209 | 64.2446 | 14.9955 | 14.6735 | | resnet18 | 16 | 0.4667 | 1.7891 | 2.6313 | 38.5966 | 11.7833 | 11.2282 | | dcgan | 32 | 0.1804 | 0.4244 | 0.6725 | 5.1633 | 10.2073 | 9.983 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4816 | 1.9758 | 2.8615 | nan | 8.985 | 8.818 | | pytorch_unet | 1 | 0.4256 | 1.839 | 2.7812 | 39.5775 | 8.2731 | 8.1783 | | LearningToPaint | 96 | 0.4958 | 1.9122 | 2.9368 | 46.723 | 7.9379 | 7.6186 | | squeezenet1_1 | 32 | 0.2658 | 0.9319 | 1.4001 | 7.065 | 4.8672 | 4.5149 | | vgg16 | 64 | 0.2032 | 0.6458 | 1.0928 | 5.0263 | 4.1909 | 3.8648 | | drq | 1 | 0.3198 | 0.6245 | 1.0412 | 6.4706 | 3.8897 | 3.5505 | | nvidia_deeprecommender | 256 | 0.2344 | 0.5149 | 0.8673 | 5.6884 | 3.5607 | 3.2507 | | dlrm | 2048 | 0.4833 | 0.8524 | nan | 4.8883 | 3.5384 | 3.3201 | | soft_actor_critic | 256 | 0.2076 | 0.361 | 0.5773 | 3.2529 | 3.3606 | 2.802 | | alexnet | 128 | 0.1792 | 0.4381 | 0.7299 | 5.0527 | 3.2041 | 3.0573 | | lennard_jones | 1000 | 0.1596 | 0.3448 | 0.5292 | 2.9458 | 2.2452 | 1.8935 | | tts_angular | 64 | 0.1941 | 0.2475 | 0.3681 | 1.471 | 1.8264 | 1.6383 | | demucs | 4 | 0.3538 | 0.357 | 0.3521 | 0.3498 | 0.2604 | 0.2606 | | hf_GPT2_large | 4 | 5.9201 | 20.1886 | nan | nan | nan | 55.7009 | | tacotron2 | 64 | 5.973 | 19.746 | 34.2981 | 91.3752 | nan | 45.8188 | | hf_BigBird | 2 | 4.0106 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2721 | 0.4638 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3844 | nan | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3557 | 0.4815 | 1.0334 | 1.1302 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3268 | nan | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3081 | nan | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0146 | 0.9624 | 0.3724 | 0.9771 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.3799 | 1.118 | 0.9649 | 1.1241 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9347 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.014 | 0.9304 | 1.2458 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9165 | 1.0224 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.912 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.9063 | 1.0466 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.334 | nan | 0.8804 | 0.8866 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8357 | nan | nan | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0 | 0.8822 | 0.4003 | 1.1061 | 0.8778 | 1.0938 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3561 | 0.7806 | 0.8658 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3415 | 1.0617 | 0.8348 | 0.9049 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.4794 | 0.775 | 0.7973 | 1.0079 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3409 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3886 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3407 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.7295 | 0.925 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3917 | 1.0881 | 0.7133 | 0.7227 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | 0.704 | 0.7306 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3941 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9992 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0014 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4688 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3906 | nan | 0.4112 | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 2048 | 493.6629 | 458.1229 | nan | 497.9979 | 536.7773 | 464.6625 | | timm_vision_transformer_large | 8 | 184.168 | 185.8378 | nan | nan | 168.7548 | 172.7203 | | Background_Matting | 4 | 134.5843 | 131.0407 | 148.7827 | 126.2129 | 107.8275 | 109.3642 | | hf_T5 | 8 | 174.5947 | 189.6001 | nan | 128.514 | 93.3621 | 93.2823 | | hf_T5_large | 2 | 215.6451 | 257.5141 | nan | nan | 89.2079 | 109.1021 | | timm_nfnet | 128 | 131.7884 | 131.7555 | 148.9718 | 142.5169 | 87.3515 | 91.7872 | | hf_Reformer | 4 | 82.4728 | 82.1491 | 82.9547 | 124.7627 | 69.9089 | 70.1454 | | Super_SloMo | 6 | 79.2878 | 79.4753 | 89.5611 | nan | 64.718 | 66.1992 | | yolov3 | 16 | 68.4877 | 69.2864 | 85.1904 | nan | 63.1721 | 64.5266 | | demucs | 4 | 58.0356 | 57.1007 | 57.2944 | 57.2712 | 57.1127 | 57.3467 | | timm_regnet | 32 | 72.7716 | 75.9439 | 80.6201 | 90.6479 | 55.0734 | 59.2031 | | vgg16 | 64 | 66.0452 | 66.2469 | 77.1826 | 68.0405 | 51.9995 | 52.4063 | | resnet152 | 32 | 89.0849 | 89.5417 | 72.9115 | nan | 45.9083 | 71.8123 | | speech_transformer | 32 | 59.8092 | 79.8574 | 34.1692 | nan | 40.6623 | 40.5596 | | fastNLP_Bert | 6 | 55.7815 | 61.0245 | 73.5812 | nan | 37.1937 | 38.5411 | | timm_efficientdet | 1 | 162.7863 | 196.0811 | 76.5661 | nan | 36.971 | 108.6853 | | attention_is_all_you_need_pytorch | 256 | 52.7455 | 57.2412 | 63.2913 | nan | 34.8341 | 36.1031 | | hf_Bart | 4 | 55.7948 | 66.2784 | 65.0982 | nan | 34.0203 | 35.3757 | | mobilenet_v2 | 96 | 48.9134 | 49.348 | 64.2962 | 46.6972 | 31.3848 | 32.3353 | | hf_Albert | 8 | 68.3299 | 71.476 | 88.2181 | nan | 29.3785 | 30.1505 | | pytorch_unet | 1 | 39.9423 | 40.2158 | 46.1709 | 36.8751 | 29.2988 | 30.0144 | | hf_GPT2 | 4 | 48.2038 | 49.3817 | 60.531 | 168.1676 | 25.3878 | 25.9212 | | timm_vovnet | 32 | 34.4106 | 36.0622 | 37.1783 | 41.9907 | 24.7882 | 28.3553 | | shufflenet_v2_x1_0 | 128 | 39.9577 | 39.9381 | 41.6368 | 50.9577 | 24.2952 | 29.2918 | | timm_efficientnet | 32 | 47.5453 | 56.8605 | 43.1739 | 68.7083 | 22.4658 | 37.581 | | hf_Bert | 4 | 39.329 | 47.8708 | 43.9615 | nan | 21.3851 | 23.4955 | | hf_DistilBert | 8 | 31.0084 | 32.0366 | 41.9416 | 84.3 | 20.9185 | 21.4268 | | resnet50 | 32 | 32.9756 | 32.7151 | 32.2899 | 40.6263 | 19.377 | 25.274 | | BERT_pytorch | 16 | 53.4137 | 64.3188 | 35.1079 | 64.0114 | 16.8148 | 24.5399 | | densenet121 | 4 | 72.3286 | 80.3884 | 29.3381 | 102.9918 | 13.2233 | 58.3138 | | timm_resnest | 32 | 24.4506 | 23.9158 | 29.3264 | 25.2122 | 12.8637 | 14.5053 | | mobilenet_v3_large | 32 | 34.9006 | 34.9728 | 23.7628 | 46.9548 | 12.0078 | 26.4533 | | mnasnet1_0 | 32 | 28.5992 | 28.8363 | 23.1413 | 37.0817 | 11.5527 | 21.9922 | | pytorch_stargan | 16 | 16.0501 | 14.2672 | 15.4931 | nan | 11.1269 | 11.6186 | | nvidia_deeprecommender | 256 | 10.3855 | 10.4226 | 14.8858 | 10.3017 | 10.4781 | 10.0711 | | timm_vision_transformer | 8 | 28.829 | 40.3897 | 16.5749 | 48.9474 | 10.0316 | 19.8075 | | resnext50_32x4d | 8 | 28.2974 | 29.7425 | 15.3557 | 38.4593 | 8.5444 | 23.2626 | | pytorch_CycleGAN_and_pix2pix | 1 | 19.9731 | 17.9113 | 10.1903 | nan | 8.3045 | 11.5685 | | LearningToPaint | 96 | 14.5067 | 14.6039 | 12.9333 | 17.6201 | 8.1297 | 11.2536 | | alexnet | 128 | 9.8067 | 9.8519 | 12.0279 | 10.5081 | 8.1083 | 8.1144 | | tts_angular | 64 | 6.8174 | 7.0105 | 6.7163 | 6.7946 | 6.4043 | 6.3476 | | squeezenet1_1 | 32 | 14.6811 | 15.1956 | 10.1869 | 21.7087 | 6.285 | 11.756 | | resnet18 | 16 | 12.6152 | 12.7414 | 8.1071 | 17.6931 | 4.8105 | 13.0486 | | functorch_dp_cifar10 | 64 | 13.8591 | 14.5911 | 5.833 | nan | 2.9682 | 10.7917 | | pytorch_struct | 200 | 4.5509 | 6.0636 | 4.4969 | 7.4708 | 2.265 | 3.7417 | | drq | 1 | 3.8832 | 4.7248 | 2.097 | 7.4037 | 1.3109 | 4.321 | | dcgan | 32 | 3.1799 | 3.2841 | 1.8632 | 5.4444 | 1.0908 | 3.1048 | | soft_actor_critic | 256 | 1.4211 | 1.8926 | 1.0609 | 2.6878 | 0.858 | 1.4763 | | lennard_jones | 1000 | 1.4572 | 1.8834 | 1.1444 | 3.1414 | 0.7353 | 1.4494 | | tacotron2 | 64 | 3066.9501 | 4159.6532 | 3006.1003 | 4999.036 | nan | 3505.0783 | | hf_GPT2_large | 4 | 209.9992 | 211.6118 | nan | nan | nan | 112.6231 | | hf_BigBird | 2 | 192.0221 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.023 | 0.8512 | 2.3733 | 0.0 | 4.7559 | 1.6071 | | MobileBertForMaskedLM | 32 | 1.0187 | 0.8284 | 1.9808 | 0.0 | 4.1381 | 1.7958 | | CamemBert | 1 | 1.0396 | 0.8628 | 1.7712 | 0.0 | 3.6115 | 1.8082 | | MobileBertForQuestionAnswering | 64 | 1.0162 | 0.8373 | 1.5818 | 0.0 | 3.5117 | 1.8042 | | MT5ForConditionalGeneration | 8 | 1.0188 | 0.8798 | 1.7866 | 0.8816 | 3.4557 | 2.51 | | DistillGPT2 | 1 | 1.0341 | 0.8894 | 1.4397 | 0.0 | 2.6231 | 1.9912 | | GPT2ForSequenceClassification | 4 | 1.0006 | 0.9762 | 0.0 | 0.5045 | 2.316 | 2.2766 | | PLBartForConditionalGeneration | 16 | 1.0135 | 0.8504 | 1.0555 | 0.0 | 2.2894 | 1.6698 | | M2M100ForConditionalGeneration | 8 | 1.1085 | 0.8721 | 1.3494 | 0.7037 | 2.2196 | 1.7488 | | ElectraForQuestionAnswering | 64 | 1.0007 | 0.978 | 0.7618 | 0.0 | 2.0322 | 1.9835 | | MegatronBertForQuestionAnswering | 16 | 1.0304 | 0.869 | 1.0398 | 0.0 | 1.9482 | 1.7948 | | MegatronBertForCausalLM | 16 | 1.0275 | 0.8613 | 0.9628 | 0.0 | 1.8078 | 1.7238 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9802 | 0.7763 | 0.0 | 1.7914 | 1.7577 | | XGLMForCausalLM | 8 | 1.0133 | 0.8329 | 0.9973 | 0.0 | 1.7706 | 1.55 | | ElectraForCausalLM | 32 | 0.9999 | 0.9418 | 0.7071 | 0.0 | 1.7521 | 1.7553 | | T5Small | 1 | 1.0261 | 0.897 | 1.3585 | 0.8646 | 1.6521 | 1.4094 | | AlbertForQuestionAnswering | 4 | 0.9998 | 0.886 | 0.0 | 0.0 | 1.6476 | 1.639 | | AlbertForMaskedLM | 4 | 1.0003 | 0.8852 | 0.0 | 0.0 | 1.6386 | 1.6263 | | MBartForConditionalGeneration | 16 | 1.0103 | 0.85 | 0.9121 | 0.0 | 1.6285 | 1.6836 | | PegasusForConditionalGeneration | 16 | 1.0113 | 0.8446 | 1.0649 | 0.642 | 1.6093 | 1.5148 | | T5ForConditionalGeneration | 4 | 1.0025 | 0.9174 | 0.7559 | 1.1602 | 1.5959 | 1.5798 | | LayoutLMForMaskedLM | 16 | 1.001 | 0.9717 | 0.7569 | 0.0 | 1.5945 | 1.5719 | | OPTForCausalLM | 32 | 1.0107 | 0.9248 | 0.7788 | 0.3387 | 1.5161 | 1.5032 | | Speech2Text2ForCausalLM | 128 | 1.0023 | 0.9378 | 0.7213 | 0.8059 | 1.494 | 1.4885 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.9841 | 0.7799 | 0.0 | 1.4504 | 1.4249 | | DistilBertForQuestionAnswering | 64 | 1.0001 | 0.969 | 0.7319 | 0.3649 | 1.4477 | 1.3945 | | BertForQuestionAnswering | 128 | 1.0 | 0.9844 | 0.7726 | 0.0 | 1.4356 | 1.4122 | | BartForConditionalGeneration | 2 | 1.0045 | 0.9696 | 0.0 | 0.0 | 1.4202 | 1.3885 | | BartForCausalLM | 4 | 1.0016 | 0.9697 | 0.7577 | 0.0 | 1.4172 | 1.4131 | | RobertaForCausalLM | 64 | 1.0001 | 0.9592 | 0.7536 | 0.0 | 1.3982 | 1.369 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0057 | 0.9254 | 0.7443 | 0.0 | 1.3745 | 1.38 | | BertForMaskedLM | 64 | 1.0007 | 0.9573 | 0.7403 | 0.0 | 1.292 | 1.2818 | | DebertaForMaskedLM | 4 | 0.916 | 0.7317 | 0.796 | 0.0 | 1.2696 | 1.1296 | | PLBartForCausalLM | 32 | 1.0061 | 0.943 | 0.7999 | 0.8389 | 1.2328 | 1.2338 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9522 | 0.7093 | 0.4653 | 1.213 | 1.2134 | | BlenderbotSmallForCausalLM | 64 | 1.0028 | 0.9287 | 0.7155 | 0.0 | 1.2107 | 1.218 | | MBartForCausalLM | 32 | 1.0029 | 0.95 | 0.7553 | 0.0 | 1.1653 | 1.1619 | | TrOCRForCausalLM | 32 | 1.0012 | 0.9482 | 0.7564 | 0.0 | 1.1593 | 1.1593 | | PegasusForCausalLM | 32 | 0.9993 | 0.9509 | 0.7496 | 0.8518 | 1.1434 | 1.1415 | | DebertaForQuestionAnswering | 8 | 0.9946 | 0.8816 | 0.7234 | 0.0 | 1.1387 | 1.2253 | | BigBird | 1 | 0.9705 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.0053 | 10.8633 | 38.0649 | nan | 106.2141 | 39.377 | | DebertaForMaskedLM | 4 | 4.912 | 10.8712 | 36.5461 | nan | 101.9292 | 38.891 | | M2M100ForConditionalGeneration | 8 | 4.0884 | 16.4294 | 27.7581 | 432.0911 | 82.5774 | 69.9489 | | MobileBertForMaskedLM | 32 | 9.8254 | 32.2283 | 58.2531 | nan | 80.9408 | 78.8002 | | XGLMForCausalLM | 8 | 3.2095 | 13.4777 | 28.5817 | nan | 80.4429 | 79.928 | | MobileBertForQuestionAnswering | 64 | 9.8372 | 32.5362 | 58.7744 | nan | 79.3363 | 77.1771 | | PegasusForConditionalGeneration | 16 | 3.7816 | 16.6547 | 27.4771 | 448.9291 | 59.4737 | 55.1433 | | BartForConditionalGeneration | 2 | 3.9995 | 17.2362 | nan | nan | 58.9034 | 57.2259 | | MBartForConditionalGeneration | 16 | 3.9426 | 17.1008 | 29.5611 | nan | 58.7128 | 58.5004 | | YituTechConvBert | 1 | 2.7849 | 10.5629 | 17.0975 | nan | 52.9799 | 49.3657 | | MegatronBertForCausalLM | 16 | 4.0292 | 14.5487 | 22.7542 | nan | 46.5658 | 45.2322 | | MegatronBertForQuestionAnswering | 16 | 4.1712 | 14.2815 | 22.2121 | nan | 45.9923 | 44.6161 | | MT5ForConditionalGeneration | 8 | 4.0206 | 12.7667 | 21.4795 | 180.1495 | 43.4249 | 41.9945 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4924 | 11.1852 | 18.6043 | nan | 39.4127 | 38.4144 | | LayoutLMForSequenceClassification | 16 | 2.2211 | 7.7008 | 11.6012 | nan | 33.3582 | 29.3079 | | PLBartForConditionalGeneration | 16 | 2.0231 | 8.6998 | 13.5434 | nan | 33.2959 | 32.5229 | | T5ForConditionalGeneration | 4 | 2.618 | 8.6264 | 13.6659 | 109.2875 | 33.0283 | 32.0727 | | T5Small | 1 | 2.5882 | 8.5786 | 13.1046 | 107.854 | 32.4705 | 35.3793 | | ElectraForCausalLM | 32 | 1.9992 | 7.1284 | 11.2587 | nan | 30.0607 | 27.5549 | | PegasusForCausalLM | 32 | 1.548 | 6.4598 | 10.4069 | 128.458 | 25.9484 | 24.3998 | | LayoutLMForMaskedLM | 16 | 2.2769 | 7.4763 | 11.6107 | nan | 25.073 | 24.2153 | | MBartForCausalLM | 32 | 1.4565 | 6.5918 | 9.865 | nan | 24.3527 | 23.2976 | | BertForMaskedLM | 64 | 1.871 | 6.97 | 10.8205 | nan | 24.2761 | 23.5597 | | RobertaForCausalLM | 64 | 1.8439 | 7.514 | 10.7459 | nan | 24.2146 | 22.9694 | | ElectraForQuestionAnswering | 64 | 2.0007 | 7.25 | 10.845 | nan | 24.0319 | 23.2337 | | TrOCRForCausalLM | 32 | 1.4478 | 6.4379 | 10.174 | nan | 23.5592 | 26.1795 | | BartForCausalLM | 4 | 1.6018 | 6.6328 | 9.7988 | nan | 23.4204 | 22.5184 | | OPTForCausalLM | 32 | 1.527 | 6.5589 | 11.9421 | 137.0388 | 23.1623 | 22.2917 | | BertForQuestionAnswering | 128 | 1.8599 | 7.0178 | 11.1009 | nan | 22.96 | 21.9982 | | RobertaForQuestionAnswering | 128 | 1.863 | 6.9387 | 10.7909 | nan | 22.138 | 21.1653 | | CamemBert | 1 | 1.8946 | 7.0989 | 10.4703 | nan | 21.1935 | 20.4914 | | AlbertForMaskedLM | 4 | 1.5729 | 6.5929 | nan | nan | 20.8673 | 19.6691 | | AlbertForQuestionAnswering | 4 | 1.853 | 6.8415 | nan | nan | 20.438 | 19.2145 | | GPT2ForSequenceClassification | 4 | 1.8048 | 6.259 | nan | 114.5363 | 19.7876 | 19.3891 | | BlenderbotSmallForCausalLM | 64 | 1.035 | 4.3533 | 6.6203 | nan | 16.9393 | 16.5913 | | Speech2Text2ForCausalLM | 128 | 0.8683 | 3.3732 | 5.6817 | 60.4049 | 15.3388 | 14.5396 | | PLBartForCausalLM | 32 | 0.8404 | 3.4242 | 5.0753 | 71.418 | 14.9087 | 14.6225 | | DistilBertForMaskedLM | 64 | 0.7917 | 3.4209 | 6.095 | 65.135 | 14.2896 | 13.9377 | | DistilBertForQuestionAnswering | 64 | 0.8229 | 3.4655 | 5.8615 | 67.4199 | 13.934 | 13.3807 | | DistillGPT2 | 1 | 0.9639 | 3.4237 | 4.9032 | nan | 13.8903 | 13.4416 | | BigBird | 1 | 4.0997 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.1872 | 1.0783 | 1.1717 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9973 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1321 | 0.9327 | 0.9866 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | nan | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.0551 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 1.0 | 0.8993 | 0.3787 | nan | 0.8829 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3726 | 1.0333 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4145 | nan | 0.8523 | 0.9876 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | 1.0791 | 0.8215 | 0.8801 | | CamemBert | 1 | 0.999 | 0.8143 | 0.4159 | nan | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9513 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.402 | nan | 0.8047 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.868 | 0.4316 | nan | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 1.0119 | 0.9402 | 0.4407 | 1.0461 | 0.7464 | 0.9888 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3623 | nan | 0.409 | 1.026 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3252 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.974 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.9521 | 301.5962 | nan | nan | 163.2106 | 164.2222 | | AlbertForQuestionAnswering | 4 | 264.904 | 299.0028 | nan | nan | 160.9414 | 161.7396 | | BartForConditionalGeneration | 2 | 135.335 | 140.2064 | nan | nan | 95.8803 | 97.8339 | | BlenderbotSmallForConditionalGeneration | 64 | 109.1332 | 118.8642 | 152.2104 | nan | 80.0629 | 79.8787 | | BartForCausalLM | 4 | 110.4633 | 115.8195 | 148.2475 | nan | 79.3637 | 79.3852 | | BertForQuestionAnswering | 128 | 110.3952 | 112.2704 | 143.1348 | nan | 77.0036 | 78.3011 | | RobertaForQuestionAnswering | 128 | 111.0822 | 112.8641 | 142.5201 | nan | 76.6426 | 77.9808 | | LayoutLMForMaskedLM | 16 | 112.2152 | 115.5223 | 148.1527 | nan | 70.3486 | 71.3149 | | MBartForConditionalGeneration | 16 | 103.0078 | 122.1017 | 114.4022 | nan | 67.051 | 71.4102 | | PegasusForConditionalGeneration | 16 | 102.3325 | 123.2183 | 113.1074 | 160.1931 | 66.9221 | 72.7297 | | DebertaForQuestionAnswering | 8 | 75.5938 | 85.2083 | 103.9052 | nan | 66.2717 | 61.1782 | | T5ForConditionalGeneration | 4 | 101.2693 | 110.2676 | 135.4048 | 86.9306 | 63.6596 | 64.8298 | | PegasusForCausalLM | 32 | 69.0074 | 72.8353 | 91.6064 | 81.1257 | 60.5009 | 60.2206 | | MBartForCausalLM | 32 | 69.6597 | 73.2865 | 92.3097 | nan | 60.0437 | 60.0024 | | TrOCRForCausalLM | 32 | 69.6308 | 73.3982 | 92.2398 | nan | 59.9682 | 59.9341 | | BertForMaskedLM | 64 | 75.5884 | 78.8327 | 101.8918 | nan | 58.3831 | 58.912 | | RobertaForCausalLM | 64 | 80.4294 | 83.9479 | 106.7778 | nan | 57.5238 | 58.7517 | | ElectraForQuestionAnswering | 64 | 115.7706 | 117.174 | 150.5546 | nan | 56.4117 | 57.8307 | | LayoutLMForSequenceClassification | 16 | 97.3801 | 99.2988 | 125.2977 | nan | 54.3353 | 55.3349 | | MobileBertForQuestionAnswering | 64 | 174.0939 | 213.398 | 122.9075 | nan | 53.3111 | 102.9513 | | XGLMForCausalLM | 8 | 90.7809 | 122.2079 | 94.0987 | nan | 52.7662 | 58.6776 | | M2M100ForConditionalGeneration | 8 | 106.3134 | 123.281 | 86.1459 | 154.6161 | 50.7324 | 63.7447 | | ElectraForCausalLM | 32 | 87.3366 | 92.4668 | 123.4249 | nan | 49.7829 | 49.7728 | | DebertaForMaskedLM | 4 | 66.6024 | 84.5511 | 78.2148 | nan | 49.509 | 55.9254 | | BlenderbotSmallForCausalLM | 64 | 59.1552 | 63.2606 | 82.4293 | nan | 48.4321 | 48.027 | | MegatronBertForCausalLM | 16 | 94.8631 | 95.208 | 83.7003 | nan | 47.2993 | 49.2888 | | MobileBertForMaskedLM | 32 | 170.0763 | 211.7235 | 108.0622 | nan | 43.577 | 102.1606 | | MegatronBertForQuestionAnswering | 16 | 86.7728 | 93.6126 | 76.6768 | nan | 43.5463 | 46.9343 | | GPT2ForSequenceClassification | 4 | 90.5299 | 92.6374 | nan | 179.5787 | 39.1271 | 39.7707 | | T5Small | 1 | 60.6713 | 68.7571 | 53.3051 | 71.1073 | 38.7357 | 45.3372 | | DistilBertForMaskedLM | 64 | 45.3072 | 47.5715 | 63.719 | 97.1505 | 37.3171 | 37.2726 | | OPTForCausalLM | 32 | 53.1116 | 58.917 | 70.0522 | 159.4732 | 35.5281 | 35.8728 | | PLBartForCausalLM | 32 | 39.019 | 41.6661 | 49.3986 | 46.2349 | 31.7964 | 31.6459 | | PLBartForConditionalGeneration | 16 | 54.1736 | 65.4586 | 53.1125 | nan | 30.8024 | 34.3804 | | MT5ForConditionalGeneration | 8 | 88.3865 | 101.1361 | 58.0526 | 99.2192 | 26.6203 | 36.7408 | | DistilBertForQuestionAnswering | 64 | 30.4725 | 31.553 | 41.8537 | 83.5842 | 21.0892 | 21.8199 | | Speech2Text2ForCausalLM | 128 | 30.1301 | 32.5705 | 42.0848 | 37.5983 | 20.4233 | 20.671 | | YituTechConvBert | 1 | 62.1334 | 73.7962 | 27.1077 | nan | 13.8335 | 47.543 | | CamemBert | 1 | 36.7979 | 45.1737 | 21.829 | nan | 11.1809 | 22.2068 | | DistillGPT2 | 1 | 19.9483 | 23.2744 | 16.1926 | nan | 8.0012 | 10.6644 | | BigBird | 1 | 191.7525 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | regnety_002 | 128 | 0.9771 | 0.9431 | 1.1198 | 0.8636 | 2.2129 | 1.4268 | | ghostnet_100 | 128 | 1.0036 | 0.9797 | 0.8892 | 1.0226 | 2.1783 | 1.7544 | | xcit_large_24_p8_224 | 5 | 0.9995 | 0.0 | 0.0 | 0.0 | 2.134 | 1.8451 | | lcnet_050 | 128 | 0.9724 | 0.9427 | 0.8444 | 1.0481 | 2.04 | 1.6299 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9981 | 0.0 | 0.0 | 1.9241 | 1.8952 | | twins_pcpvt_base | 64 | 1.0051 | 0.9219 | 1.0266 | 0.0 | 1.6647 | 1.6556 | | coat_lite_mini | 128 | 1.0 | 0.9931 | 0.847 | 1.1147 | 1.6088 | 1.5789 | | res2net101_26w_4s | 64 | 1.0033 | 1.0042 | 0.9591 | 0.0 | 1.5896 | 1.326 | | dla102 | 128 | 1.0003 | 0.9963 | 0.8389 | 1.3175 | 1.581 | 1.5491 | | hrnet_w18 | 128 | 1.0037 | 1.0249 | 0.8611 | 0.0 | 1.5771 | 1.4459 | | volo_d1_224 | 64 | 0.9995 | 0.991 | 0.846 | 0.0 | 1.5548 | 1.5172 | | nfnet_l0 | 128 | 0.9993 | 0.8102 | 0.7124 | 0.8493 | 1.5482 | 1.471 | | gmixer_24_224 | 128 | 1.0 | 0.8794 | 0.7225 | 0.9283 | 1.5422 | 1.4864 | | gmlp_s16_224 | 128 | 0.9998 | 0.9958 | 0.7866 | 1.015 | 1.5258 | 1.4592 | | resnest101e | 64 | 0.9999 | 0.9913 | 0.8118 | 0.0 | 1.5203 | 1.433 | | adv_inception_v3 | 128 | 1.0001 | 0.9967 | 0.8532 | 1.143 | 1.5056 | 1.4712 | | inception_v3 | 128 | 0.9999 | 0.9968 | 0.8539 | 1.1446 | 1.5032 | 1.4672 | | gluon_inception_v3 | 128 | 0.9999 | 0.9961 | 0.8536 | 1.137 | 1.4997 | 1.4655 | | dm_nfnet_f0 | 128 | 0.999 | 0.9993 | 0.8802 | 0.9252 | 1.4969 | 1.43 | | swin_base_patch4_window7_224 | 64 | 1.0 | 0.9591 | 0.0 | 0.0 | 1.4832 | 1.4518 | | res2net50_14w_8s | 128 | 0.9999 | 0.9933 | 0.8117 | 1.0088 | 1.4628 | 1.4043 | | mobilenetv3_large_100 | 128 | 0.9552 | 0.9449 | 0.7817 | 0.9951 | 1.4531 | 1.4484 | | cait_m36_384 | 4 | 1.0002 | 0.0 | 0.0 | 0.0 | 1.4427 | 1.3666 | | selecsls42b | 128 | 0.9999 | 0.9952 | 0.8421 | 1.281 | 1.4422 | 1.412 | | mnasnet_100 | 128 | 0.954 | 0.9435 | 0.7882 | 1.2161 | 1.4307 | 1.4505 | | res2next50 | 128 | 0.9993 | 0.9962 | 0.8323 | 1.1462 | 1.4173 | 1.346 | | mobilenetv2_100 | 128 | 0.9511 | 0.9417 | 0.7051 | 1.1385 | 1.4036 | 1.4323 | | crossvit_9_240 | 128 | 1.0001 | 0.9954 | 0.8379 | 0.9227 | 1.4004 | 1.3719 | | fbnetv3_b | 128 | 0.9527 | 0.9504 | 0.7732 | 0.0 | 1.3922 | 1.3973 | | mobilevit_s | 64 | 0.9736 | 0.8145 | 0.6568 | 0.0 | 1.3763 | 1.3623 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9647 | 0.7676 | 1.1277 | 1.376 | 1.3749 | | jx_nest_base | 32 | 0.9995 | 0.9928 | 0.7963 | 0.0 | 1.3552 | 1.3271 | | fbnetc_100 | 128 | 0.9534 | 0.9435 | 0.7927 | 1.1694 | 1.3535 | 1.3782 | | tf_efficientnet_b0 | 128 | 0.9658 | 0.8079 | 0.6626 | 0.9435 | 1.3499 | 1.3567 | | resmlp_12_224 | 128 | 1.0003 | 0.9966 | 0.7816 | 1.491 | 1.3426 | 1.3092 | | spnasnet_100 | 128 | 0.9464 | 0.9343 | 0.7771 | 1.1069 | 1.3386 | 1.3965 | | convit_base | 64 | 1.0002 | 0.9966 | 0.8329 | 1.2337 | 1.3358 | 1.341 | | poolformer_m36 | 64 | 0.9999 | 0.998 | 0.8027 | 0.0 | 1.3277 | 1.2971 | | botnet26t_256 | 128 | 0.9803 | 0.9751 | 0.8095 | 1.2793 | 1.3253 | 1.3299 | | pit_b_224 | 64 | 0.9997 | 0.9951 | 0.822 | 0.9653 | 1.3146 | 1.3092 | | pnasnet5large | 16 | 1.0063 | 1.0364 | 0.8522 | 0.0 | 1.3137 | 1.2556 | | cspdarknet53 | 64 | 0.9434 | 0.9346 | 0.7553 | 1.1472 | 1.2881 | 1.3166 | | rexnet_100 | 128 | 0.9651 | 0.8495 | 0.6901 | 0.0 | 1.2799 | 1.277 | | eca_botnext26ts_256 | 128 | 0.9797 | 0.8116 | 0.6718 | 1.0732 | 1.2744 | 1.2656 | | tinynet_a | 128 | 0.96 | 0.8027 | 0.6531 | 0.7932 | 1.2682 | 1.2559 | | mixer_b16_224 | 128 | 1.0001 | 0.9979 | 0.8024 | 0.9021 | 1.2572 | 1.2467 | | beit_base_patch16_224 | 64 | 1.0001 | 0.9791 | 0.0 | 0.0 | 1.2422 | 1.2336 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9918 | 0.7975 | 0.9776 | 1.2371 | 1.2226 | | visformer_small | 128 | 1.0 | 0.9981 | 0.8417 | 0.0 | 1.2289 | 1.1765 | | sebotnet33ts_256 | 64 | 0.9665 | 0.8367 | 0.6796 | 0.9719 | 1.1926 | 1.2078 | | tf_mixnet_l | 128 | 0.9807 | 0.9099 | 0.7953 | 0.0 | 1.1767 | 1.1743 | | mixnet_l | 128 | 0.9797 | 0.9043 | 0.7941 | 0.0 | 1.1625 | 1.1561 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 0.8315 | 0.9125 | 1.1583 | 1.147 | | gluon_xception65 | 32 | 0.9995 | 0.9894 | 0.7544 | 0.0 | 1.156 | 1.1248 | | dpn107 | 32 | 0.9405 | 0.9273 | 0.7504 | 0.0 | 1.1507 | 1.1659 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.9809 | 0.7915 | 0.0 | 1.1381 | 1.0542 | | repvgg_a2 | 128 | 0.943 | 0.9316 | 0.7998 | 1.0737 | 1.0826 | 1.1149 | | gernet_l | 128 | 0.946 | 0.9376 | 0.7694 | 1.0633 | 1.0692 | 1.078 | | convmixer_768_32 | 32 | 0.9999 | 0.998 | 0.9232 | 0.0 | 1.0561 | 1.0512 | | convnext_base | 64 | 0.9995 | 0.9952 | 0.8007 | 0.0 | 0.6628 | 0.6485 | | eca_halonext26ts | 128 | 0.9811 | 0.8169 | 0.6787 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.9007 | 30.5955 | 56.8885 | nan | 142.0493 | 134.58 | | twins_pcpvt_base | 64 | 2.9985 | 14.5712 | 26.2034 | nan | 126.7577 | 125.0894 | | pnasnet5large | 16 | 5.3033 | 22.6972 | 41.7597 | nan | 90.9169 | 85.7682 | | xcit_large_24_p8_224 | 5 | 3.4767 | nan | nan | nan | 90.472 | 86.0284 | | cait_m36_384 | 4 | 3.7466 | nan | nan | nan | 83.0326 | 78.6723 | | swin_base_patch4_window7_224 | 64 | 3.1785 | 13.1678 | nan | nan | 81.111 | 78.185 | | resnest101e | 64 | 3.5908 | 16.5844 | 27.3075 | nan | 74.5208 | 71.6586 | | convnext_base | 64 | 1.5831 | 6.9635 | 11.2546 | nan | 74.1408 | 71.0828 | | mobilevit_s | 64 | 2.0146 | 7.4407 | 15.4149 | nan | 67.9259 | 65.9194 | | jx_nest_base | 32 | 1.9244 | 9.088 | 15.64 | nan | 64.674 | 61.4967 | | res2net101_26w_4s | 64 | 3.516 | 16.8031 | 28.3034 | nan | 62.0216 | 58.9645 | | coat_lite_mini | 128 | 1.3027 | 5.2528 | 8.3466 | 115.795 | 60.1315 | 58.6418 | | res2net50_14w_8s | 128 | 3.1203 | 14.5771 | 24.4511 | 341.1423 | 56.0854 | 53.3611 | | poolformer_m36 | 64 | 1.795 | 7.3019 | 12.2538 | nan | 54.2112 | 48.8877 | | sebotnet33ts_256 | 64 | 1.7916 | 6.1901 | 13.2208 | 150.2645 | 46.7957 | 45.5211 | | gmlp_s16_224 | 128 | 1.3615 | 7.3869 | 12.0575 | 195.8462 | 45.8264 | 43.3694 | | dpn107 | 32 | 4.3023 | 13.7204 | 39.7398 | nan | 45.7396 | 42.9415 | | gluon_xception65 | 32 | 2.2871 | 10.9133 | 18.0303 | nan | 45.4533 | 42.3356 | | fbnetv3_b | 128 | 3.531 | 12.2577 | 28.3976 | nan | 44.5772 | 42.2501 | | crossvit_9_240 | 128 | 1.8601 | 8.5726 | 13.4709 | 200.305 | 43.8539 | 41.9383 | | volo_d1_224 | 64 | 1.4365 | 7.708 | 12.0857 | nan | 43.4372 | 42.9169 | | tnt_s_patch16_224 | 128 | 1.9915 | 10.6722 | nan | nan | 41.7693 | 39.9302 | | gluon_inception_v3 | 128 | 1.8037 | 8.6705 | 13.2891 | 189.1412 | 40.0796 | 36.573 | | eca_botnext26ts_256 | 128 | 1.4679 | 4.8502 | 10.3876 | 121.5836 | 39.4373 | 39.4923 | | inception_v3 | 128 | 1.8013 | 8.5102 | 13.5949 | 184.4868 | 38.4091 | 35.658 | | tf_mixnet_l | 128 | 5.9511 | 13.093 | 26.7344 | nan | 38.2248 | 34.9963 | | adv_inception_v3 | 128 | 1.7649 | 8.5658 | 13.3652 | 185.5465 | 37.8293 | 35.8741 | | dla102 | 128 | 2.035 | 9.384 | 15.1033 | 246.3103 | 37.5115 | 35.1769 | | ghostnet_100 | 128 | 3.3395 | 9.8045 | 14.6424 | 195.6781 | 37.1634 | 36.4913 | | mixnet_l | 128 | 5.7265 | 12.3749 | 26.3087 | nan | 36.6209 | 34.3964 | | swsl_resnext101_32x16d | 32 | 2.0429 | 9.392 | 15.5568 | nan | 36.1883 | 34.2833 | | gmixer_24_224 | 128 | 1.6067 | 8.2394 | 13.7733 | 187.0454 | 36.1528 | 34.2341 | | botnet26t_256 | 128 | 1.4602 | 4.3133 | 9.869 | 92.7558 | 34.2706 | 33.4399 | | dm_nfnet_f0 | 128 | 2.1816 | 7.2933 | 10.6929 | 167.0033 | 33.6265 | 31.2056 | | res2next50 | 128 | 1.8635 | 8.1476 | 12.6765 | 205.124 | 31.7987 | 30.949 | | convit_base | 64 | 1.435 | 6.1526 | 9.6702 | 151.9744 | 30.8169 | 30.0641 | | rexnet_100 | 128 | 2.0846 | 7.5705 | 17.0963 | nan | 30.4383 | 29.2987 | | tinynet_a | 128 | 2.2667 | 8.2083 | 19.834 | 198.7082 | 30.3582 | 28.3142 | | tf_efficientnet_b0 | 128 | 2.0066 | 6.9827 | 17.6255 | 184.2145 | 27.1838 | 24.6776 | | cspdarknet53 | 64 | 2.4655 | 7.5616 | 18.7844 | 149.8005 | 25.8301 | 26.1663 | | mixer_b16_224 | 128 | 0.9794 | 3.7969 | 5.9233 | 83.1889 | 25.7501 | 24.7588 | | visformer_small | 128 | 1.145 | 4.2021 | 6.2373 | nan | 25.3172 | 23.2639 | | fbnetc_100 | 128 | 2.199 | 6.9652 | 17.6241 | 137.701 | 25.2219 | 23.6579 | | deit_base_distilled_patch16_224 | 64 | 1.0595 | 4.738 | 7.3794 | 86.0029 | 25.0746 | 23.7605 | | spnasnet_100 | 128 | 2.1997 | 6.8544 | 16.8716 | 133.5756 | 25.019 | 23.5974 | | convmixer_768_32 | 32 | 1.3415 | 6.2675 | 10.3396 | nan | 24.8368 | 23.8252 | | nfnet_l0 | 128 | 2.0103 | 7.2146 | 10.6505 | 149.195 | 24.2334 | 22.1743 | | vit_base_patch16_224 | 64 | 1.0584 | 4.7814 | 7.9026 | 84.3127 | 24.1034 | 22.9243 | | pit_b_224 | 64 | 1.1205 | 5.2241 | 8.5352 | 109.7875 | 24.0128 | 23.0531 | | mobilenetv3_large_100 | 128 | 1.7145 | 5.6514 | 13.2705 | 144.1973 | 23.989 | 21.8396 | | resmlp_12_224 | 128 | 0.8243 | 3.2276 | 5.0201 | 51.0692 | 23.9701 | 22.8431 | | beit_base_patch16_224 | 64 | 1.3577 | 5.5999 | nan | nan | 22.651 | 21.8991 | | repvgg_a2 | 128 | 2.181 | 6.1759 | 15.4753 | 188.8896 | 22.2313 | 21.275 | | mobilenetv2_100 | 128 | 1.8728 | 5.6377 | 13.5701 | 121.4784 | 21.4423 | 20.4269 | | regnety_002 | 128 | 1.7608 | 5.8055 | 13.0595 | 116.9918 | 20.717 | 19.8477 | | mnasnet_100 | 128 | 1.875 | 5.4884 | 13.201 | 107.6343 | 20.6289 | 20.1514 | | gernet_l | 128 | 2.099 | 6.1261 | 15.4171 | 112.2221 | 20.3565 | 19.347 | | selecsls42b | 128 | 1.0095 | 3.8902 | 5.8459 | 92.1233 | 18.0785 | 17.2239 | | lcnet_050 | 128 | 1.1776 | 3.4617 | 7.6119 | 78.3924 | 14.9724 | 14.0531 | | ese_vovnet19b_dw | 128 | 1.12 | 3.1512 | 6.6557 | 67.1681 | 13.9969 | 13.826 | | eca_halonext26ts | 128 | 1.4742 | 5.0378 | 11.5687 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2733 | nan | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3632 | nan | 1.1603 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.476 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.268 | 0.3766 | 1.0331 | 1.1821 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3469 | nan | 0.9621 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | nan | nan | nan | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.9072 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3594 | 1.222 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8729 | 0.9865 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3443 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3213 | 0.5513 | 0.745 | 0.8294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.5191 | 297.0819 | 321.1167 | nan | 280.7711 | 282.2687 | | tnt_s_patch16_224 | 128 | 363.9733 | 364.6689 | nan | nan | 189.4311 | 192.2899 | | hrnet_w18 | 128 | 294.5869 | 291.3478 | 346.0568 | nan | 188.0633 | 206.256 | | convnext_base | 64 | 121.6835 | 121.841 | 151.7247 | nan | 183.1544 | 187.4591 | | pnasnet5large | 16 | 217.5093 | 212.1645 | 258.8648 | nan | 169.2901 | 174.0513 | | tf_mixnet_l | 128 | 195.3044 | 210.581 | 240.9617 | nan | 162.71 | 162.9212 | | mixnet_l | 128 | 187.023 | 202.3143 | 230.3907 | nan | 157.6619 | 158.2455 | | convit_base | 64 | 181.6249 | 182.0798 | 217.7128 | 146.7516 | 135.8596 | 135.358 | | pit_b_224 | 64 | 155.3306 | 155.6887 | 188.5042 | 160.4875 | 117.8381 | 118.434 | | cait_m36_384 | 4 | 166.2313 | nan | nan | nan | 117.3851 | 121.9282 | | dla102 | 128 | 178.8336 | 179.2373 | 213.3219 | 135.721 | 112.9863 | 115.1428 | | poolformer_m36 | 64 | 148.7928 | 149.0967 | 185.8263 | nan | 112.3963 | 114.654 | | beit_base_patch16_224 | 64 | 134.9744 | 137.8036 | nan | nan | 108.7038 | 109.5907 | | gluon_inception_v3 | 128 | 161.2351 | 161.7455 | 189.1576 | 141.9377 | 107.6108 | 109.91 | | resnest101e | 64 | 164.2291 | 164.4998 | 199.8098 | nan | 107.4682 | 114.6958 | | adv_inception_v3 | 128 | 161.2442 | 161.8867 | 188.8364 | 141.0409 | 107.2772 | 109.4424 | | inception_v3 | 128 | 161.1007 | 161.4151 | 188.2962 | 140.4066 | 107.2129 | 109.7345 | | vit_base_patch16_224 | 64 | 120.5057 | 121.2135 | 145.0847 | 131.9778 | 104.0895 | 104.9973 | | swsl_resnext101_32x16d | 32 | 117.9586 | 120.2449 | 147.9384 | nan | 104.0479 | 111.7188 | | res2net50_14w_8s | 128 | 145.9646 | 146.9639 | 180.3522 | 144.6156 | 99.747 | 103.9054 | | swin_base_patch4_window7_224 | 64 | 147.4218 | 153.8095 | nan | nan | 99.4938 | 101.4255 | | res2next50 | 128 | 138.5739 | 139.2147 | 166.2532 | 120.9292 | 97.9819 | 103.0634 | | mixer_b16_224 | 128 | 118.3927 | 118.6945 | 147.717 | 131.2577 | 94.2924 | 94.9651 | | dpn107 | 32 | 114.1448 | 116.2102 | 142.9534 | nan | 93.1465 | 91.6948 | | gmlp_s16_224 | 128 | 136.4577 | 136.7626 | 173.3511 | 134.2361 | 89.319 | 93.3348 | | jx_nest_base | 32 | 119.4181 | 119.8571 | 149.45 | nan | 88.0343 | 89.5958 | | dm_nfnet_f0 | 128 | 131.2243 | 131.0839 | 149.1748 | 142.2685 | 87.3255 | 91.7933 | | volo_d1_224 | 64 | 134.9064 | 135.5646 | 159.2379 | nan | 86.7394 | 88.8379 | | eca_botnext26ts_256 | 128 | 112.1139 | 135.414 | 163.8255 | 102.4868 | 86.2598 | 86.7936 | | gluon_xception65 | 32 | 98.1135 | 98.766 | 129.6576 | nan | 84.9126 | 86.9446 | | fbnetv3_b | 128 | 120.9444 | 123.387 | 149.0516 | nan | 83.0541 | 84.0009 | | visformer_small | 128 | 98.4237 | 98.1723 | 116.7189 | nan | 80.0355 | 83.6042 | | botnet26t_256 | 128 | 105.916 | 106.4749 | 128.4394 | 81.0942 | 78.3769 | 78.0294 | | crossvit_9_240 | 128 | 109.643 | 109.8731 | 130.6228 | 118.4855 | 78.1837 | 79.6932 | | gmixer_24_224 | 128 | 119.951 | 136.5657 | 166.3535 | 129.4574 | 77.895 | 80.6904 | | res2net101_26w_4s | 64 | 121.1174 | 121.5526 | 126.7358 | nan | 77.7583 | 92.5493 | | twins_pcpvt_base | 64 | 125.2336 | 135.4873 | 138.8866 | nan | 76.9896 | 87.514 | | deit_base_distilled_patch16_224 | 64 | 94.2279 | 94.8172 | 118.1029 | 96.2245 | 76.0584 | 76.9173 | | coat_lite_mini | 128 | 116.0254 | 116.9803 | 137.1014 | 103.9831 | 72.128 | 73.4177 | | gernet_l | 128 | 79.8504 | 80.4814 | 98.5478 | 71.1172 | 70.916 | 70.2278 | | cspdarknet53 | 64 | 95.8688 | 96.7729 | 119.9937 | 78.8741 | 70.3855 | 68.7833 | | repvgg_a2 | 128 | 79.7626 | 80.6819 | 94.1535 | 70.0399 | 69.6266 | 67.4415 | | rexnet_100 | 128 | 90.9775 | 103.463 | 127.3235 | nan | 68.7849 | 68.8437 | | nfnet_l0 | 128 | 105.868 | 130.8876 | 148.7231 | 124.9244 | 68.1846 | 71.9624 | | sebotnet33ts_256 | 64 | 83.1834 | 96.2645 | 118.3769 | 82.7658 | 67.4266 | 66.6233 | | tf_efficientnet_b0 | 128 | 90.6027 | 108.4796 | 132.0958 | 92.7564 | 64.9052 | 64.5274 | | mobilevit_s | 64 | 90.1892 | 107.4363 | 133.5464 | nan | 63.6581 | 64.24 | | xcit_large_24_p8_224 | 5 | 124.7265 | nan | nan | nan | 62.2476 | 71.1225 | | fbnetc_100 | 128 | 87.9178 | 88.8007 | 105.7782 | 71.5889 | 61.9752 | 60.8697 | | tinynet_a | 128 | 75.3718 | 91.386 | 110.9803 | 91.7461 | 58.166 | 57.9372 | | spnasnet_100 | 128 | 76.6794 | 77.6764 | 93.4227 | 65.5487 | 54.2065 | 51.9902 | | resmlp_12_224 | 128 | 68.356 | 68.5321 | 87.5144 | 45.8294 | 50.9395 | 52.1561 | | ese_vovnet19b_dw | 128 | 67.8819 | 68.4303 | 86.1141 | 58.482 | 47.9615 | 47.9956 | | mnasnet_100 | 128 | 70.1647 | 70.9784 | 84.8852 | 54.9698 | 46.7827 | 46.1594 | | ghostnet_100 | 128 | 98.4748 | 97.5924 | 107.8389 | 94.1171 | 46.0625 | 56.7199 | | mobilenetv2_100 | 128 | 67.7192 | 68.4178 | 91.3491 | 56.5408 | 45.9171 | 44.8798 | | selecsls42b | 128 | 62.8593 | 63.1908 | 74.7196 | 49.0633 | 43.6609 | 44.4898 | | mobilenetv3_large_100 | 128 | 66.0152 | 66.7635 | 80.5302 | 63.3278 | 43.5638 | 43.4736 | | regnety_002 | 128 | 53.6246 | 56.3222 | 47.0018 | 61.0936 | 25.4645 | 37.6397 | | lcnet_050 | 128 | 33.9509 | 34.9005 | 38.7435 | 31.2538 | 17.0303 | 20.3715 | | eca_halonext26ts | 128 | 116.004 | 139.1878 | 167.84 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/oaBDqn6.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/KNUYPOG.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/KUulBbs.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 95%, 53/56 | 98%, 42/43  | 98%, 60/61  |
|       aot_eager        | 93%, 52/56 | 95%, 41/43  | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 41/56 | 72%, 31/43  | 46%, 28/61  |
|    nvprims_nvfuser     | 77%, 43/56 | 60%, 26/43  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 91%, 39/43  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 51/56 | 91%, 39/43  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.05x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.02x    |    1.14x    |
|        inductor        |   1.53x    |    1.30x    |    1.25x    |
| inductor_no_cudagraphs |   1.23x    |    1.23x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.86    |    2.34     |    2.06     |
|       aot_eager        |    5.45    |    7.34     |    7.23     |
|     aot_cudagraphs     |    7.54    |    14.38    |    13.21    |
|    nvprims_nvfuser     |   63.74    |    98.63    |   148.48    |
|        inductor        |   30.44    |    30.55    |    37.13    |
| inductor_no_cudagraphs |   29.67    |    26.28    |    35.39    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.90x    |    1.00x    |    0.95x    |
|        inductor        |   0.83x    |    0.71x    |    0.98x    |
| inductor_no_cudagraphs |   0.97x    |    0.97x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7781 | 0.9455 | | torchbench | soft_actor_critic | 1.4336 | 0.9446 | | torchbench | nvidia_deeprecommender | 0.9049 | 0.9642 | | torchbench | dlrm | 0.0 | 1.0841 | | torchbench | hf_GPT2_large | 0.0 | 1.4765 | | torchbench | hf_T5 | 0.0 | 1.573 | | torchbench | tacotron2 | 0.0 | 0.927 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5454 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 367.726 | 363.2658 | | torchbench | timm_efficientdet | 130.3881 | 128.0873 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0022 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8326 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6881 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0027 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4061 | 0.4214 | | torchbench | dcgan | 0.2564 | 0.2576 | | torchbench | dlrm | nan | 0.7307 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8167 | 0.9378 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.7968 | 0.8799 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | M2M100ForConditionalGeneration | 0.7712 | 1.0075 | | huggingface | MT5ForConditionalGeneration | 0.7627 | 0.9397 | | huggingface | GoogleFnet | 0.7589 | 0.969 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7485 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.724 | 0.9375 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6761 | 0.8845 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8822 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8525 | 1.0753 | | timm_models | coat_lite_mini | 0.8441 | 1.0596 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6493 | 0.869 | | timm_models | repvgg_a2 | 0.5319 | 0.8172 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/jXD7TyE.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/InGFvn9.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.002 | 1.0283 | 2.367 | 0.793 | 5.5899 | 1.279 | | timm_efficientdet | 1 | 0.9844 | 0.8936 | 1.8866 | 0.758 | 4.3839 | 1.513 | | functorch_dp_cifar10 | 64 | 1.0053 | 1.0305 | 2.2149 | 0.0 | 4.2606 | 1.2029 | | timm_vision_transformer | 8 | 1.0055 | 0.9463 | 1.5341 | 0.6757 | 2.5888 | 1.353 | | drq | 1 | 1.0107 | 0.8702 | 1.6441 | 0.707 | 2.4756 | 0.9829 | | BERT_pytorch | 16 | 1.0112 | 0.898 | 1.1084 | 0.9461 | 2.305 | 2.0612 | | resnext50_32x4d | 8 | 1.0036 | 1.1062 | 1.2076 | 0.8076 | 2.0816 | 1.208 | | mobilenet_v3_large | 32 | 1.0052 | 1.1137 | 1.0461 | 0.8726 | 2.0042 | 1.3252 | | resnet18 | 16 | 1.0008 | 1.1172 | 1.2621 | 0.8815 | 1.9664 | 1.2353 | | dcgan | 32 | 0.9793 | 1.0265 | 1.2948 | 0.7961 | 1.9603 | 1.0131 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9951 | 1.0206 | 1.2103 | 0.8604 | 1.9009 | 1.4854 | | pytorch_struct | 200 | 0.9954 | 0.7496 | 0.8874 | 0.7851 | 1.8119 | 1.1615 | | squeezenet1_1 | 32 | 0.9969 | 1.0254 | 1.0613 | 0.8752 | 1.7921 | 1.2624 | | lennard_jones | 1000 | 0.9576 | 0.8577 | 1.0371 | 0.686 | 1.7781 | 0.9455 | | hf_T5_large | 2 | 1.0254 | 0.9208 | 0.0 | 0.0 | 1.6918 | 1.6926 | | hf_Albert | 8 | 1.0008 | 0.9985 | 0.7527 | 1.5517 | 1.6477 | 1.6401 | | shufflenet_v2_x1_0 | 128 | 0.9993 | 1.0642 | 0.8223 | 0.9064 | 1.5451 | 1.4111 | | timm_resnest | 32 | 0.9994 | 1.0025 | 0.8051 | 1.1649 | 1.5178 | 1.4524 | | hf_GPT2 | 4 | 1.0091 | 0.9779 | 0.7405 | 0.4015 | 1.5022 | 1.5101 | | mnasnet1_0 | 32 | 0.9987 | 1.0989 | 0.8577 | 0.9184 | 1.4747 | 1.2619 | | timm_nfnet | 128 | 0.9995 | 1.0005 | 0.0 | 1.135 | 1.4717 | 1.4213 | | soft_actor_critic | 256 | 1.0044 | 0.8043 | 1.1032 | 0.7012 | 1.4336 | 0.9446 | | mobilenet_v2 | 96 | 0.9997 | 0.9989 | 0.7301 | 1.3368 | 1.4298 | 1.4093 | | fastNLP_Bert | 6 | 0.9989 | 0.977 | 0.7539 | 1.1545 | 1.4213 | 1.3937 | | mobilenet_v2_quantized_qat | 96 | 1.0014 | 0.9777 | 0.0 | 1.4567 | 1.4081 | 1.4045 | | speech_transformer | 32 | 1.0019 | 0.9079 | 1.3856 | 0.7432 | 1.4056 | 1.4112 | | timm_efficientnet | 32 | 0.9573 | 0.8192 | 0.7059 | 0.8256 | 1.3923 | 1.1946 | | resnet50_quantized_qat | 32 | 1.0023 | 0.9691 | 0.0 | 1.1526 | 1.3667 | 1.3744 | | resnet152 | 32 | 1.0011 | 1.0582 | 0.8131 | 0.9041 | 1.2916 | 1.2171 | | hf_Bert | 4 | 1.0328 | 1.0064 | 0.7349 | 0.8807 | 1.2794 | 1.1802 | | pytorch_stargan | 16 | 0.9993 | 1.0763 | 0.9332 | 0.0 | 1.2658 | 1.2296 | | LearningToPaint | 96 | 1.0059 | 1.0397 | 0.8946 | 0.9732 | 1.2499 | 1.2092 | | pytorch_unet | 1 | 0.9998 | 0.281 | 0.0 | 0.0 | 1.2094 | 1.1945 | | resnet50 | 32 | 0.9992 | 0.9942 | 0.7658 | 0.9863 | 1.2038 | 1.1699 | | hf_Bart | 4 | 1.0112 | 0.9765 | 0.7396 | 0.8481 | 1.2022 | 1.2011 | | hf_DistilBert | 8 | 1.0004 | 0.956 | 0.6866 | 0.5218 | 1.1755 | 1.1812 | | vgg16 | 64 | 1.0 | 0.999 | 0.8591 | 0.9978 | 1.1707 | 1.1686 | | alexnet | 128 | 0.9985 | 0.9982 | 0.8028 | 1.0052 | 1.1623 | 1.1641 | | Super_SloMo | 6 | 0.9995 | 0.2414 | 0.0 | 0.2471 | 1.1474 | 1.1323 | | hf_Reformer | 4 | 0.9978 | 1.0021 | 0.9898 | 0.7393 | 1.1321 | 1.1411 | | timm_regnet | 32 | 0.9655 | 0.9626 | 0.7745 | 1.0936 | 1.1311 | 1.0915 | | yolov3 | 16 | 1.0 | 0.9948 | 0.7823 | 1.1526 | 1.0931 | 1.0793 | | Background_Matting | 4 | 1.0003 | 0.1915 | 0.0 | 0.0 | 1.0816 | 1.0719 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9693 | 0.7575 | 0.9554 | 1.0633 | 1.0467 | | timm_vision_transformer_large | 8 | 0.9998 | 0.9956 | 0.0 | 0.0 | 1.0494 | 1.0348 | | timm_vovnet | 32 | 0.91 | 0.9044 | 0.7118 | 0.8775 | 1.0078 | 1.0197 | | tts_angular | 64 | 0.9855 | 0.9682 | 0.9926 | 0.9799 | 1.006 | 1.0129 | | demucs | 4 | 1.0007 | 0.9996 | 0.9998 | 1.0 | 0.9999 | 0.9997 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9637 | 0.5851 | 0.9763 | 0.9049 | 0.9642 | | dlrm | 2048 | 0.0 | 1.0814 | 0.0 | 1.0896 | 0.0 | 1.0841 | | hf_GPT2_large | 4 | 1.0001 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4765 | | hf_T5 | 8 | 1.0009 | 0.9537 | 0.0 | 1.1701 | 0.0 | 1.573 | | tacotron2 | 64 | 0.9834 | 0.8733 | 0.0 | 0.7716 | 0.0 | 0.927 | | hf_BigBird | 2 | 0.9801 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8807 | 7.259 | 10.4638 | 113.2902 | 367.726 | 363.2658 | | timm_efficientdet | 1 | 19.7967 | 33.7988 | 66.6771 | 484.3336 | 130.3881 | 128.0873 | | hf_T5_large | 2 | 14.0082 | 34.3823 | nan | nan | 113.1845 | 110.508 | | timm_vision_transformer_large | 8 | 2.5123 | 11.1104 | nan | nan | 53.7542 | 52.2482 | | attention_is_all_you_need_pytorch | 256 | 1.2767 | 5.6115 | 8.9442 | 115.692 | 46.3064 | 45.373 | | densenet121 | 4 | 2.1945 | 9.9935 | 15.9759 | 168.4349 | 43.9179 | 42.1458 | | resnet152 | 32 | 2.4725 | 10.8513 | 17.8488 | 190.3924 | 43.3608 | 41.1323 | | timm_resnest | 32 | 0.5887 | 2.0654 | 3.0871 | 59.9525 | 39.5609 | 39.4402 | | timm_vision_transformer | 8 | 0.8648 | 3.3865 | 4.8971 | 74.9425 | 31.7172 | 30.4505 | | timm_nfnet | 128 | 2.043 | 6.1861 | nan | 155.2958 | 29.0902 | 28.2483 | | hf_Bart | 4 | 1.664 | 6.526 | 10.471 | 120.1202 | 28.8464 | 28.1093 | | mobilenet_v2_quantized_qat | 96 | 1.3121 | 7.3383 | nan | 183.8735 | 28.3677 | 28.4654 | | BERT_pytorch | 16 | 1.5367 | 5.8841 | 8.9191 | 83.0114 | 28.1248 | 26.8919 | | resnet50_quantized_qat | 32 | 1.2471 | 7.0562 | nan | 170.0977 | 27.9367 | 28.3744 | | fastNLP_Bert | 6 | 1.5358 | 5.3282 | 8.6411 | 80.8419 | 27.0223 | 24.8237 | | pytorch_stargan | 16 | 0.4376 | 1.7244 | 2.4621 | nan | 26.0491 | 26.361 | | speech_transformer | 32 | 1.6208 | 6.6302 | 26.4753 | 124.9677 | 23.9809 | 22.2548 | | timm_regnet | 32 | 2.2891 | 6.6663 | 19.4683 | 112.1156 | 23.9598 | 23.1945 | | timm_efficientnet | 32 | 1.7823 | 5.6781 | 13.8471 | 109.7589 | 23.6685 | 22.8594 | | mobilenet_v3_large | 32 | 0.9087 | 3.8819 | 5.7929 | 98.1432 | 22.9827 | 23.0171 | | pytorch_struct | 200 | 0.2554 | 0.629 | 1.1776 | 4.4896 | 19.2314 | 22.9522 | | hf_Bert | 4 | 1.5692 | 5.2428 | 7.7186 | 83.749 | 19.1858 | 18.2935 | | Super_SloMo | 6 | 1.0376 | 6.5134 | nan | 56.1213 | 18.6095 | 18.0297 | | mnasnet1_0 | 32 | 0.8255 | 3.5274 | 5.1863 | 72.7155 | 18.3043 | 18.4501 | | hf_Reformer | 4 | 1.6103 | 2.6305 | 4.8438 | 14.631 | 18.3019 | 15.6137 | | shufflenet_v2_x1_0 | 128 | 0.9478 | 4.1694 | 6.2145 | 89.1772 | 18.1477 | 17.3928 | | timm_vovnet | 32 | 1.5396 | 3.817 | 8.971 | 58.6272 | 17.9981 | 17.3908 | | resnet50 | 32 | 0.8681 | 3.727 | 5.8044 | 75.7857 | 17.8111 | 17.5767 | | hf_Albert | 8 | 1.2745 | 4.6759 | 7.2609 | 115.6277 | 17.5903 | 16.7315 | | resnext50_32x4d | 8 | 0.9248 | 3.7125 | 5.4861 | 64.4126 | 17.5455 | 16.995 | | hf_GPT2 | 4 | 1.4861 | 4.8677 | 7.4294 | 62.5911 | 17.4001 | 16.5234 | | Background_Matting | 4 | 0.7907 | 7.7124 | nan | nan | 17.0716 | 16.3261 | | mobilenet_v2 | 96 | 0.8345 | 3.767 | 5.9694 | 97.9922 | 16.8705 | 16.2033 | | functorch_dp_cifar10 | 64 | 0.3078 | 1.1128 | 1.7123 | nan | 14.8992 | 14.8213 | | hf_DistilBert | 8 | 0.6586 | 2.612 | 4.5239 | 40.5058 | 11.9275 | 11.4804 | | resnet18 | 16 | 0.4205 | 1.4755 | 2.2297 | 30.5835 | 11.1727 | 10.832 | | pytorch_unet | 1 | 0.4502 | 2.6712 | nan | nan | 8.6179 | 8.2316 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4239 | 1.5807 | 2.3191 | 32.4835 | 8.1934 | 7.8573 | | LearningToPaint | 96 | 0.4574 | 1.5789 | 2.3874 | 38.6551 | 7.0438 | 6.5122 | | dcgan | 32 | 0.187 | 0.3602 | 0.5751 | 4.4017 | 5.9997 | 5.7101 | | squeezenet1_1 | 32 | 0.231 | 0.6829 | 1.0663 | 4.3528 | 3.9483 | 3.8257 | | drq | 1 | 0.2969 | 0.4957 | 0.7472 | 4.4239 | 3.8254 | 3.3307 | | vgg16 | 64 | 0.1973 | 0.4541 | 0.8304 | 3.0908 | 3.4692 | 3.2018 | | soft_actor_critic | 256 | 0.1963 | 0.2887 | 0.4888 | 1.5993 | 3.3549 | 2.852 | | nvidia_deeprecommender | 256 | 0.2032 | 0.3829 | 0.6397 | 4.7741 | 3.3129 | 3.0572 | | alexnet | 128 | 0.1533 | 0.3191 | 0.5644 | 2.9203 | 2.9895 | 2.6365 | | lennard_jones | 1000 | 0.1429 | 0.2486 | 0.3949 | 1.3747 | 1.9909 | 1.8012 | | tts_angular | 64 | 0.1777 | 0.2143 | 0.3433 | 1.0728 | 1.9157 | 1.6868 | | demucs | 4 | 0.3013 | 0.3005 | 0.3024 | 0.2988 | 0.2079 | 0.2105 | | hf_GPT2_large | 4 | 5.2606 | 15.8764 | nan | nan | nan | 44.3538 | | tacotron2 | 64 | 5.5508 | 15.3638 | nan | 42.2644 | nan | 44.0207 | | hf_T5 | 8 | 2.5453 | 7.6698 | nan | 76.9534 | nan | 26.9261 | | dlrm | 2048 | nan | 0.7119 | nan | 2.8942 | nan | 2.8375 | | hf_BigBird | 2 | 3.3534 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.582 | 1.582 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.2258 | 1.4867 | 1.4871 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | 0.988 | 1.3107 | 1.3923 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2891 | 1.1375 | 1.1162 | 1.1442 | | Super_SloMo | 6 | 1.0024 | 0.902 | nan | 0.9454 | 1.1136 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2738 | 1.1209 | 1.0397 | 1.0443 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 1.0219 | 1.0958 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 0.9821 | 1.0223 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9405 | 1.0831 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9342 | 1.0395 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8549 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.9117 | 1.105 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9068 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.7926 | 0.8982 | 1.0022 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8777 | 0.3387 | 1.0865 | 0.8326 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3503 | 1.1277 | 0.826 | 1.0815 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3304 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9442 | 0.7177 | 0.3385 | 0.6271 | 0.6881 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9995 | 0.577 | 1.0027 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4061 | 0.4214 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1613 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1507 | | dlrm | 2048 | nan | 0.7305 | nan | 0.7306 | nan | 0.7307 | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 196.0999 | 197.2905 | nan | nan | 187.5964 | 189.9322 | | Background_Matting | 4 | 186.4617 | 973.7403 | nan | nan | 172.3996 | 174.0253 | | timm_nfnet | 128 | 206.4675 | 205.9807 | nan | 181.7487 | 139.6754 | 144.6123 | | hf_T5_large | 2 | 187.5866 | 211.6902 | nan | nan | 118.5217 | 121.1715 | | mobilenet_v2_quantized_qat | 96 | 147.8734 | 151.0848 | nan | 101.4009 | 105.4346 | 105.4488 | | Super_SloMo | 6 | 117.6211 | 486.1255 | nan | 475.616 | 102.3664 | 103.5865 | | yolov3 | 16 | 102.2263 | 102.7553 | 130.552 | 88.8157 | 93.6876 | 94.683 | | vgg16 | 64 | 106.3674 | 106.4006 | 123.8994 | 106.3247 | 90.6617 | 91.0156 | | timm_regnet | 32 | 101.337 | 101.9568 | 126.1745 | 89.8057 | 86.9327 | 89.4721 | | demucs | 4 | 77.5777 | 77.7209 | 77.6134 | 77.6115 | 77.5726 | 77.7158 | | hf_Reformer | 4 | 83.7628 | 83.3327 | 84.0933 | 112.29 | 73.578 | 72.8805 | | resnet152 | 32 | 90.4322 | 85.4116 | 113.0807 | 100.3316 | 73.5201 | 77.1113 | | resnet50_quantized_qat | 32 | 97.1729 | 96.2735 | nan | 81.2327 | 69.4964 | 69.9786 | | attention_is_all_you_need_pytorch | 256 | 71.9907 | 74.2143 | 95.5247 | 75.2072 | 68.0375 | 68.6804 | | mobilenet_v2 | 96 | 71.3948 | 71.4376 | 97.8132 | 53.367 | 49.9416 | 50.6888 | | pytorch_unet | 1 | 58.492 | 208.0358 | nan | nan | 48.4952 | 48.9884 | | hf_Bart | 4 | 55.1187 | 57.0267 | 74.7245 | 64.6953 | 45.7837 | 45.9858 | | hf_Albert | 8 | 74.9673 | 75.2696 | 99.9412 | 48.3571 | 45.6859 | 45.7535 | | fastNLP_Bert | 6 | 59.654 | 60.8426 | 79.1198 | 51.5366 | 42.2747 | 42.7473 | | timm_vovnet | 32 | 42.3401 | 42.6199 | 54.0984 | 44.6349 | 38.2777 | 37.7189 | | speech_transformer | 32 | 49.0999 | 58.5326 | 35.6806 | 64.2175 | 36.6238 | 35.0967 | | hf_GPT2 | 4 | 49.7294 | 51.3113 | 68.2873 | 125.5221 | 33.5537 | 33.3742 | | timm_efficientnet | 32 | 44.9537 | 52.7947 | 61.243 | 52.6465 | 33.188 | 36.2375 | | hf_DistilBert | 8 | 38.8806 | 40.6618 | 56.6495 | 74.5405 | 33.1347 | 32.8853 | | hf_Bert | 4 | 38.0075 | 39.218 | 53.469 | 43.9542 | 32.4662 | 32.9047 | | timm_efficientdet | 1 | 140.5324 | 163.3135 | 83.3095 | 182.7956 | 32.3031 | 92.4056 | | resnet50 | 32 | 38.6674 | 38.938 | 50.8671 | 39.1294 | 32.1219 | 33.0793 | | shufflenet_v2_x1_0 | 128 | 37.0957 | 35.1363 | 45.9853 | 41.107 | 24.4636 | 26.6038 | | BERT_pytorch | 16 | 46.1258 | 51.8861 | 42.1829 | 48.6828 | 23.0489 | 22.9923 | | timm_resnest | 32 | 31.713 | 31.5959 | 39.3836 | 27.2175 | 20.795 | 21.7074 | | mnasnet1_0 | 32 | 29.5307 | 26.0809 | 33.2445 | 31.4066 | 19.4285 | 23.7344 | | pytorch_stargan | 16 | 24.2501 | 22.4614 | 25.9678 | nan | 19.102 | 19.6522 | | mobilenet_v3_large | 32 | 31.8087 | 27.9204 | 30.4781 | 36.6667 | 16.101 | 25.7131 | | resnext50_32x4d | 8 | 29.4554 | 24.6249 | 22.218 | 33.0301 | 13.4286 | 22.4374 | | LearningToPaint | 96 | 16.5712 | 15.2114 | 18.4252 | 16.1187 | 12.434 | 12.8808 | | densenet121 | 4 | 65.1004 | 63.8981 | 28.7102 | 84.4615 | 12.218 | 52.051 | | alexnet | 128 | 12.4151 | 12.4227 | 15.4659 | 12.4043 | 10.6811 | 10.6877 | | nvidia_deeprecommender | 256 | 8.5309 | 8.8495 | 14.5783 | 8.7548 | 9.4199 | 8.8465 | | timm_vision_transformer | 8 | 23.5187 | 25.6088 | 15.7257 | 38.9902 | 9.3532 | 21.5024 | | tts_angular | 64 | 9.4104 | 9.7887 | 9.4884 | 9.4289 | 9.2019 | 9.1786 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.2957 | 16.0423 | 13.5492 | 19.6778 | 9.1668 | 11.2925 | | squeezenet1_1 | 32 | 12.7083 | 13.3374 | 12.0945 | 14.6251 | 7.3389 | 10.4739 | | resnet18 | 16 | 12.0525 | 10.8967 | 10.3973 | 14.0236 | 6.6458 | 10.6774 | | functorch_dp_cifar10 | 64 | 11.5138 | 11.441 | 5.3439 | nan | 3.2304 | 9.9175 | | pytorch_struct | 200 | 3.777 | 5.0037 | 4.289 | 4.8009 | 2.1011 | 3.2671 | | dcgan | 32 | 3.1845 | 2.5704 | 2.0988 | 4.137 | 1.3537 | 2.5904 | | drq | 1 | 2.8428 | 3.5627 | 1.8727 | 4.19 | 1.2347 | 3.8885 | | soft_actor_critic | 256 | 0.9785 | 1.222 | 0.9184 | 1.5123 | 0.7452 | 1.1308 | | lennard_jones | 1000 | 1.1238 | 1.26 | 1.0333 | 1.6229 | 0.6405 | 1.2071 | | tacotron2 | 64 | 2763.2667 | 3609.7647 | nan | 3574.4011 | nan | 3022.3754 | | dlrm | 2048 | nan | 476.1851 | nan | 475.6062 | nan | 507.4878 | | hf_GPT2_large | 4 | 241.0348 | 245.581 | nan | nan | nan | 162.9478 | | hf_T5 | 8 | 182.9213 | 191.7345 | nan | 156.6333 | nan | 116.3155 | | hf_BigBird | 2 | 190.0247 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0278 | 0.948 | 1.79 | 0.0 | 3.3175 | 1.45 | | CamemBert | 1 | 1.0453 | 0.9641 | 1.3525 | 0.0 | 2.4625 | 1.5279 | | MT5ForConditionalGeneration | 8 | 1.0286 | 0.9114 | 1.1982 | 0.9981 | 2.2904 | 1.9702 | | MobileBertForMaskedLM | 32 | 1.0219 | 0.9444 | 1.3058 | 0.0 | 2.0404 | 1.5528 | | DistillGPT2 | 1 | 1.0308 | 0.9621 | 1.1743 | 0.0 | 2.0299 | 1.8536 | | GoogleFnet | 1 | 0.9797 | 0.8049 | 0.9724 | 0.0 | 1.8759 | 1.1406 | | GPT2ForSequenceClassification | 4 | 1.0012 | 0.9714 | 0.0 | 0.7005 | 1.7987 | 1.7834 | | T5ForConditionalGeneration | 4 | 0.9997 | 0.9324 | 0.7228 | 1.092 | 1.4532 | 1.4409 | | MobileBertForQuestionAnswering | 64 | 1.0228 | 0.9333 | 0.8629 | 0.0 | 1.4403 | 1.2635 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9855 | 0.0 | 1.1809 | 1.427 | 1.4084 | | ElectraForCausalLM | 32 | 1.0006 | 0.9323 | 0.0 | 1.0177 | 1.4131 | 1.4502 | | M2M100ForConditionalGeneration | 8 | 1.0964 | 0.9401 | 0.914 | 0.7999 | 1.4015 | 1.3403 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9892 | 0.7369 | 1.1164 | 1.3035 | 1.2913 | | T5Small | 1 | 1.0199 | 0.9585 | 1.0124 | 0.9537 | 1.2896 | 1.1461 | | AlbertForQuestionAnswering | 4 | 1.001 | 1.0017 | 0.0 | 1.2365 | 1.262 | 1.2584 | | AlbertForMaskedLM | 4 | 1.0007 | 0.9992 | 0.0 | 1.2313 | 1.2589 | 1.2546 | | PLBartForConditionalGeneration | 16 | 1.0156 | 0.963 | 0.8171 | 0.7986 | 1.2189 | 1.1994 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9707 | 0.0 | 1.0621 | 1.214 | 1.2112 | | OPTForCausalLM | 32 | 1.0061 | 0.9298 | 0.708 | 0.4636 | 1.1884 | 1.2066 | | DistilBertForQuestionAnswering | 64 | 1.0003 | 0.9844 | 0.7113 | 0.5146 | 1.1704 | 1.1522 | | XGLMForCausalLM | 8 | 1.0117 | 0.942 | 0.7319 | 0.3086 | 1.1704 | 1.1806 | | RobertaForCausalLM | 64 | 1.0006 | 0.9634 | 0.7451 | 0.9757 | 1.153 | 1.1569 | | MegatronBertForQuestionAnswering | 16 | 1.0386 | 1.0141 | 0.7683 | 0.8434 | 1.1415 | 1.1809 | | MegatronBertForCausalLM | 16 | 1.0337 | 1.0081 | 0.751 | 0.9007 | 1.1398 | 1.1168 | | Speech2Text2ForCausalLM | 128 | 0.9975 | 0.9377 | 0.6569 | 0.9342 | 1.1238 | 1.1519 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9933 | 0.0 | 1.0277 | 1.123 | 1.1145 | | BertForQuestionAnswering | 128 | 0.9999 | 0.9942 | 0.0 | 1.027 | 1.111 | 1.1095 | | BartForCausalLM | 4 | 1.0003 | 0.9668 | 0.7547 | 0.9932 | 1.1014 | 1.1106 | | BartForConditionalGeneration | 2 | 1.001 | 0.988 | 0.0 | 0.4438 | 1.099 | 1.0893 | | MBartForConditionalGeneration | 16 | 1.013 | 0.9856 | 0.7619 | 0.9076 | 1.0984 | 1.0811 | | DebertaForMaskedLM | 4 | 0.9151 | 0.8118 | 0.7283 | 0.642 | 1.0794 | 1.0504 | | PegasusForConditionalGeneration | 16 | 1.0073 | 0.9804 | 0.7559 | 0.8682 | 1.077 | 1.0818 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0008 | 0.9407 | 0.0 | 0.9544 | 1.0731 | 1.0726 | | BertForMaskedLM | 64 | 1.0004 | 0.9616 | 0.7309 | 0.9714 | 1.062 | 1.0609 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9521 | 0.713 | 0.6294 | 1.0517 | 1.0671 | | DebertaForQuestionAnswering | 8 | 0.9971 | 0.9912 | 0.6884 | 0.8624 | 1.0495 | 1.2216 | | PLBartForCausalLM | 32 | 1.0056 | 0.936 | 0.7177 | 0.9105 | 1.013 | 1.0588 | | TrOCRForCausalLM | 32 | 1.0003 | 0.9525 | 0.7347 | 0.9541 | 1.0044 | 1.0144 | | BlenderbotSmallForCausalLM | 64 | 1.0015 | 0.9102 | 0.6842 | 0.9192 | 1.0017 | 1.0423 | | MBartForCausalLM | 32 | 1.0011 | 0.9517 | 0.7307 | 0.9538 | 0.9983 | 1.0104 | | PegasusForCausalLM | 32 | 0.9996 | 0.9541 | 0.7329 | 0.9525 | 0.9933 | 1.0047 | | BigBird | 1 | 0.9756 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.6799 | 9.7451 | 37.7603 | 74.9791 | 93.7311 | 37.1832 | | DebertaForMaskedLM | 4 | 4.625 | 9.6849 | 34.4142 | 77.2589 | 93.6008 | 35.7049 | | XGLMForCausalLM | 8 | 2.5873 | 9.9496 | 21.8737 | 211.6812 | 69.683 | 66.9669 | | M2M100ForConditionalGeneration | 8 | 3.1585 | 12.3055 | 18.9273 | 266.8168 | 55.6702 | 54.2702 | | MobileBertForMaskedLM | 32 | 8.5245 | 23.5221 | 41.4135 | nan | 52.9445 | 50.9581 | | MobileBertForQuestionAnswering | 64 | 8.7709 | 23.5284 | 41.1035 | nan | 52.0634 | 50.0271 | | BartForConditionalGeneration | 2 | 3.2899 | 12.4093 | nan | 281.9647 | 45.0986 | 44.9473 | | MBartForConditionalGeneration | 16 | 3.2535 | 12.5113 | 21.4955 | 316.6465 | 44.5973 | 42.933 | | PegasusForConditionalGeneration | 16 | 3.074 | 12.1197 | 20.1468 | 320.5566 | 44.1078 | 40.4883 | | YituTechConvBert | 1 | 2.4101 | 7.9178 | 12.725 | nan | 40.7864 | 37.9998 | | MegatronBertForCausalLM | 16 | 3.3172 | 10.5067 | 17.1596 | 210.5015 | 35.0185 | 33.2222 | | MegatronBertForQuestionAnswering | 16 | 3.3817 | 10.6746 | 17.5357 | 201.925 | 34.6781 | 34.129 | | MT5ForConditionalGeneration | 8 | 3.8065 | 11.2468 | 17.8549 | 127.459 | 33.2467 | 31.8467 | | BlenderbotSmallForConditionalGeneration | 64 | 2.1059 | 8.2157 | nan | 157.2939 | 31.3383 | 29.6592 | | T5ForConditionalGeneration | 4 | 2.5001 | 7.569 | 12.1564 | 79.5698 | 30.5342 | 28.9539 | | T5Small | 1 | 2.5169 | 7.3789 | 11.1163 | 77.738 | 28.412 | 27.7859 | | LayoutLMForSequenceClassification | 16 | 1.9615 | 5.7457 | 9.1403 | 84.3767 | 27.1967 | 26.6941 | | PLBartForConditionalGeneration | 16 | 1.6852 | 6.3591 | 10.4439 | 114.76 | 27.1244 | 25.5205 | | ElectraForCausalLM | 32 | 1.5973 | 5.413 | nan | 86.2367 | 26.3008 | 24.1763 | | GoogleFnet | 1 | 0.9732 | 2.8177 | 10.6991 | nan | 25.6812 | 17.7055 | | PegasusForCausalLM | 32 | 1.243 | 4.8461 | 7.7024 | 86.0554 | 21.8712 | 19.6855 | | MBartForCausalLM | 32 | 1.198 | 4.5996 | 7.5836 | 89.0155 | 21.4244 | 21.0342 | | LayoutLMForMaskedLM | 16 | 1.9633 | 5.6911 | nan | 90.3046 | 21.2562 | 20.5129 | | BertForMaskedLM | 64 | 1.617 | 5.2712 | 7.9962 | 86.6911 | 21.1919 | 20.1105 | | TrOCRForCausalLM | 32 | 1.2218 | 4.6215 | 7.2409 | 85.3976 | 20.6503 | 19.0722 | | BertForQuestionAnswering | 128 | 1.5937 | 5.2899 | nan | 83.9809 | 20.391 | 19.1219 | | ElectraForQuestionAnswering | 64 | 1.5842 | 5.2631 | nan | 88.1157 | 20.3013 | 19.6335 | | BartForCausalLM | 4 | 1.3111 | 4.6259 | 7.2641 | 85.2918 | 19.8259 | 19.7265 | | RobertaForCausalLM | 64 | 1.6095 | 5.3798 | 8.2239 | 91.7222 | 19.7341 | 19.0074 | | RobertaForQuestionAnswering | 128 | 1.6176 | 5.3596 | nan | 85.3554 | 19.3167 | 18.0855 | | OPTForCausalLM | 32 | 1.2853 | 4.7295 | 9.3914 | 76.3554 | 18.9257 | 18.0902 | | CamemBert | 1 | 1.6859 | 5.2929 | 7.9892 | nan | 18.7475 | 18.3393 | | GPT2ForSequenceClassification | 4 | 1.5484 | 5.0725 | nan | 64.722 | 16.972 | 15.7389 | | AlbertForQuestionAnswering | 4 | 1.3008 | 4.6933 | nan | 113.414 | 16.3598 | 15.2659 | | AlbertForMaskedLM | 4 | 1.3095 | 4.5517 | nan | 113.1522 | 16.0982 | 15.3411 | | BlenderbotSmallForCausalLM | 64 | 0.838 | 3.2272 | 4.8729 | 55.9989 | 14.7732 | 13.9733 | | Speech2Text2ForCausalLM | 128 | 0.7496 | 2.5652 | 4.6055 | 40.46 | 14.4733 | 12.998 | | PLBartForCausalLM | 32 | 0.6958 | 2.5523 | 3.9889 | 41.8362 | 14.0777 | 12.807 | | DistillGPT2 | 1 | 0.8862 | 2.6329 | 3.9098 | nan | 12.242 | 12.1196 | | DistilBertForMaskedLM | 64 | 0.6757 | 2.5734 | 4.5747 | 44.3736 | 11.6489 | 11.1026 | | DistilBertForQuestionAnswering | 64 | 0.6812 | 2.6069 | 4.6045 | 41.2106 | 10.9448 | 10.5888 | | BigBird | 1 | 3.3956 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | 0.3432 | 0.934 | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 1.0877 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9986 | 0.8218 | 0.3792 | nan | 0.8167 | 0.9378 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8616 | 0.3686 | nan | 0.7968 | 0.8799 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.844 | 0.7929 | 0.9036 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 1.0402 | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.9515 | | M2M100ForConditionalGeneration | 8 | 0.9995 | 0.9448 | 0.3904 | 0.9926 | 0.7712 | 1.0075 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | 0.415 | 0.9323 | 0.7627 | 0.9397 | | GoogleFnet | 1 | 0.9629 | 0.9629 | 0.3852 | nan | 0.7589 | 0.969 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.9933 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8248 | 0.3613 | nan | 0.7485 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.9443 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8955 | 0.3584 | 1.0147 | 0.724 | 0.9375 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.9566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 1.0 | 0.8826 | 0.352 | 1.0007 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 1.1317 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 1.0014 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.9997 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.842 | 0.3524 | 0.871 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9999 | 0.8655 | 0.3605 | 0.9159 | 0.6761 | 0.8845 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.888 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.9904 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 1.0 | 0.9206 | 0.3642 | 0.989 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | BigBird | 1 | 0.9548 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 382.8326 | 383.0125 | nan | 310.6197 | 305.0 | 305.3386 | | AlbertForQuestionAnswering | 4 | 380.5344 | 379.5962 | nan | 307.0978 | 301.9886 | 302.7807 | | BartForConditionalGeneration | 2 | 150.0118 | 151.6062 | nan | 337.8961 | 136.8385 | 137.4939 | | BertForQuestionAnswering | 128 | 147.3579 | 148.1404 | nan | 143.4601 | 133.2165 | 133.3454 | | RobertaForQuestionAnswering | 128 | 147.9479 | 148.81 | nan | 143.8389 | 131.8788 | 132.9877 | | LayoutLMForMaskedLM | 16 | 136.7203 | 140.8239 | nan | 128.8331 | 113.0997 | 113.0326 | | BlenderbotSmallForConditionalGeneration | 64 | 119.003 | 126.7452 | nan | 124.8529 | 112.3724 | 111.2709 | | BartForCausalLM | 4 | 123.4086 | 127.3506 | 163.4741 | 124.2628 | 112.1893 | 111.1877 | | PegasusForConditionalGeneration | 16 | 104.6211 | 106.754 | 137.5146 | 119.1405 | 95.9982 | 96.9807 | | MBartForConditionalGeneration | 16 | 104.7459 | 106.7563 | 138.9083 | 114.8513 | 95.9609 | 96.898 | | MobileBertForQuestionAnswering | 64 | 130.4311 | 139.4289 | 155.5727 | nan | 95.7697 | 106.7314 | | BertForMaskedLM | 64 | 100.8983 | 104.9443 | 138.2649 | 103.7669 | 95.3203 | 95.0469 | | RobertaForCausalLM | 64 | 109.1513 | 113.1428 | 146.6005 | 111.8055 | 94.8255 | 94.27 | | ElectraForQuestionAnswering | 64 | 124.8743 | 126.6547 | nan | 105.7493 | 87.5176 | 88.7372 | | LayoutLMForSequenceClassification | 16 | 113.4665 | 114.348 | 153.5552 | 101.3223 | 86.919 | 87.5996 | | PegasusForCausalLM | 32 | 85.5172 | 89.5286 | 116.9102 | 89.7445 | 86.4125 | 84.9563 | | MBartForCausalLM | 32 | 86.0133 | 90.1298 | 118.0831 | 90.1715 | 86.1638 | 84.8863 | | TrOCRForCausalLM | 32 | 85.9562 | 90.1004 | 117.3379 | 90.0867 | 85.8262 | 84.6613 | | DebertaForQuestionAnswering | 8 | 82.1068 | 82.5123 | 120.8167 | 94.9626 | 77.9305 | 66.9199 | | ElectraForCausalLM | 32 | 105.7779 | 113.4079 | nan | 104.0291 | 74.9314 | 72.8632 | | T5ForConditionalGeneration | 4 | 104.5653 | 111.501 | 144.5241 | 95.7283 | 71.9807 | 72.0107 | | MegatronBertForCausalLM | 16 | 78.0876 | 79.6204 | 108.262 | 95.5379 | 71.1304 | 71.9673 | | XGLMForCausalLM | 8 | 81.5474 | 84.8001 | 108.4665 | 257.7003 | 68.3111 | 67.7789 | | MobileBertForMaskedLM | 32 | 129.2324 | 138.3697 | 123.4244 | nan | 67.423 | 88.9181 | | MegatronBertForQuestionAnswering | 16 | 72.8592 | 73.0126 | 98.3637 | 88.0595 | 65.4795 | 66.7287 | | BlenderbotSmallForCausalLM | 64 | 64.7213 | 71.0561 | 94.601 | 70.5141 | 64.3761 | 62.0355 | | M2M100ForConditionalGeneration | 8 | 78.1047 | 90.9634 | 93.7314 | 125.2241 | 62.091 | 66.651 | | DistilBertForMaskedLM | 64 | 63.2749 | 66.4553 | 88.8527 | 100.601 | 60.3527 | 59.3443 | | GPT2ForSequenceClassification | 4 | 103.074 | 105.3216 | nan | 145.7116 | 56.837 | 57.2725 | | DebertaForMaskedLM | 4 | 65.3283 | 73.1593 | 83.2718 | 92.5435 | 54.942 | 56.7658 | | OPTForCausalLM | 32 | 61.8501 | 66.3346 | 87.9685 | 133.6294 | 52.2419 | 51.3511 | | T5Small | 1 | 63.1059 | 58.4815 | 56.6413 | 57.9337 | 44.9876 | 49.646 | | PLBartForConditionalGeneration | 16 | 49.0248 | 50.6323 | 61.1051 | 61.629 | 40.8444 | 40.8981 | | PLBartForCausalLM | 32 | 41.3204 | 44.1368 | 57.6935 | 45.3571 | 40.2655 | 39.2084 | | MT5ForConditionalGeneration | 8 | 74.9978 | 98.4122 | 63.4774 | 89.3138 | 34.4387 | 39.5985 | | DistilBertForQuestionAnswering | 64 | 39.674 | 40.2857 | 55.9383 | 77.2402 | 33.9698 | 34.4452 | | Speech2Text2ForCausalLM | 128 | 35.2457 | 38.2611 | 53.1997 | 38.0112 | 31.4997 | 30.4946 | | YituTechConvBert | 1 | 48.1812 | 52.4119 | 28.437 | nan | 15.8887 | 37.3817 | | CamemBert | 1 | 30.3234 | 31.3898 | 23.194 | nan | 13.2425 | 21.584 | | GoogleFnet | 1 | 19.7358 | 22.9868 | 21.1114 | nan | 10.5617 | 17.3124 | | DistillGPT2 | 1 | 19.3022 | 18.2603 | 16.5212 | nan | 8.6365 | 9.924 | | BigBird | 1 | 187.569 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9995 | 0.9733 | 0.8279 | 1.2413 | 1.8602 | 1.8318 | | lcnet_050 | 128 | 0.9567 | 0.9493 | 0.7676 | 1.3713 | 1.6584 | 1.6348 | | xcit_large_24_p8_224 | 5 | 1.004 | 0.998 | 0.786 | 0.0 | 1.5164 | 1.3755 | | regnety_002 | 128 | 0.9758 | 1.0036 | 0.856 | 0.9692 | 1.504 | 1.3327 | | dm_nfnet_f0 | 128 | 0.9998 | 1.0007 | 0.0 | 1.1375 | 1.4732 | 1.4226 | | hrnet_w18 | 128 | 0.9998 | 0.9985 | 0.0 | 1.232 | 1.42 | 1.3794 | | volo_d1_224 | 64 | 0.9999 | 0.9956 | 0.8028 | 0.0 | 1.3865 | 1.3651 | | dla102 | 128 | 1.0 | 1.0008 | 0.0 | 1.2835 | 1.3838 | 1.3692 | | nfnet_l0 | 128 | 1.0001 | 0.788 | 0.0 | 1.1054 | 1.3716 | 1.3224 | | res2net50_14w_8s | 128 | 1.0 | 0.9992 | 0.0 | 1.2515 | 1.3554 | 1.3238 | | mobilenetv3_large_100 | 128 | 0.9656 | 0.9632 | 0.7645 | 1.2821 | 1.3368 | 1.3477 | | mobilenetv2_100 | 128 | 0.9667 | 0.9636 | 0.7061 | 1.288 | 1.3352 | 1.3494 | | crossvit_9_240 | 128 | 0.9999 | 0.9984 | 0.7557 | 1.0407 | 1.3277 | 1.3052 | | adv_inception_v3 | 128 | 1.0001 | 0.9987 | 0.0 | 1.1298 | 1.3276 | 1.3078 | | gluon_inception_v3 | 128 | 1.0001 | 0.9986 | 0.0 | 1.1292 | 1.3272 | 1.3091 | | inception_v3 | 128 | 1.0 | 0.998 | 0.0 | 1.1265 | 1.3264 | 1.3094 | | resnest101e | 64 | 0.9996 | 1.0029 | 0.0 | 1.1692 | 1.3168 | 1.2696 | | res2next50 | 128 | 0.9998 | 1.0006 | 0.0 | 1.1815 | 1.3114 | 1.2744 | | gmixer_24_224 | 128 | 0.9999 | 0.8344 | 0.0 | 1.0782 | 1.3113 | 1.2854 | | coat_lite_mini | 128 | 1.0 | 0.9999 | 0.8456 | 1.0932 | 1.3048 | 1.292 | | jx_nest_base | 32 | 1.0001 | 0.995 | 0.7366 | 0.0 | 1.2836 | 1.2544 | | botnet26t_256 | 128 | 0.9848 | 0.9846 | 0.7895 | 0.0 | 1.2761 | 1.2717 | | fbnetv3_b | 128 | 0.9645 | 0.9609 | 0.7604 | 1.2365 | 1.2744 | 1.2934 | | selecsls42b | 128 | 0.9999 | 0.9989 | 0.8156 | 1.2158 | 1.2677 | 1.2537 | | mnasnet_100 | 128 | 0.9665 | 0.9625 | 0.7833 | 1.2559 | 1.2648 | 1.2817 | | tf_efficientnet_b0 | 128 | 0.9766 | 0.7838 | 0.0 | 1.1651 | 1.2598 | 1.2644 | | eca_botnext26ts_256 | 128 | 0.9864 | 0.7728 | 0.0 | 0.0 | 1.2546 | 1.2397 | | sebotnet33ts_256 | 64 | 0.976 | 0.8075 | 0.0 | 0.0 | 1.252 | 1.26 | | fbnetc_100 | 128 | 0.966 | 0.963 | 0.7905 | 1.2395 | 1.2513 | 1.2674 | | eca_halonext26ts | 128 | 0.9872 | 0.7789 | 0.0 | 0.0 | 1.2445 | 1.2216 | | cait_m36_384 | 4 | 0.9996 | 0.999 | 0.0 | 0.0 | 1.2433 | 1.2208 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.9778 | 0.7438 | 1.1496 | 1.2378 | 1.2473 | | spnasnet_100 | 128 | 0.961 | 0.9577 | 0.7723 | 1.2245 | 1.2368 | 1.2547 | | res2net101_26w_4s | 64 | 0.9997 | 0.9966 | 0.7725 | 1.1161 | 1.23 | 1.1888 | | convit_base | 64 | 0.9996 | 0.9988 | 0.0 | 0.0 | 1.2242 | 1.2171 | | cspdarknet53 | 64 | 0.9573 | 0.954 | 0.7354 | 1.171 | 1.2198 | 1.2327 | | gmlp_s16_224 | 128 | 0.9999 | 0.9992 | 0.0 | 1.085 | 1.2167 | 1.2052 | | rexnet_100 | 128 | 0.9733 | 0.8152 | 0.0 | 1.1623 | 1.2119 | 1.22 | | pnasnet5large | 16 | 0.9996 | 0.9983 | 0.0 | 1.0901 | 1.2068 | 1.1916 | | pit_b_224 | 64 | 1.0003 | 0.9992 | 0.0 | 1.033 | 1.2038 | 1.1925 | | tinynet_a | 128 | 0.966 | 0.7749 | 0.6119 | 1.1489 | 1.1924 | 1.2021 | | dpn107 | 32 | 0.9584 | 0.9504 | 0.7793 | 1.0229 | 1.1877 | 1.2018 | | mobilevit_s | 64 | 0.9788 | 0.762 | 0.0 | 0.0 | 1.1726 | 1.1723 | | poolformer_m36 | 64 | 0.9999 | 0.9994 | 0.0 | 0.0 | 1.1666 | 1.1455 | | tf_mixnet_l | 128 | 0.9851 | 0.8896 | 0.0 | 1.0954 | 1.165 | 1.1663 | | repvgg_a2 | 128 | 0.9646 | 0.9629 | 0.8275 | 1.1371 | 1.1457 | 1.1478 | | mixnet_l | 128 | 0.9845 | 0.8859 | 0.0 | 1.0993 | 1.1455 | 1.1449 | | convnext_base | 64 | 0.9999 | 0.9986 | 0.0 | 0.0 | 1.1427 | 1.1097 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9784 | 0.0 | 0.0 | 1.1398 | 1.1319 | | twins_pcpvt_base | 64 | 1.0002 | 0.9986 | 0.7496 | 0.0 | 1.1307 | 1.1057 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9804 | 0.0 | 0.0 | 1.115 | 1.1035 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9992 | 0.0 | 1.109 | 1.11 | 1.0708 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.999 | 0.7677 | 0.9806 | 1.0952 | 1.083 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9987 | 0.7678 | 0.9512 | 1.0902 | 1.0739 | | gluon_xception65 | 32 | 0.9999 | 0.997 | 0.0 | 1.0812 | 1.0877 | 1.0753 | | mixer_b16_224 | 128 | 1.0001 | 1.0003 | 0.0 | 0.8941 | 1.0825 | 1.0722 | | convmixer_768_32 | 32 | 0.9999 | 1.0001 | 0.0 | 0.0 | 1.0778 | 1.0748 | | gernet_l | 128 | 0.974 | 0.9731 | 0.8233 | 1.0986 | 1.0742 | 1.072 | | visformer_small | 128 | 0.9998 | 1.0013 | 0.7977 | 0.0 | 1.0411 | 1.0084 | | resmlp_12_224 | 128 | 0.9998 | 0.9983 | 0.6947 | 1.2133 | 1.0108 | 0.9966 | | tnt_s_patch16_224 | 128 | 0.9997 | 0.9985 | 0.0 | 0.0 | 0.0 | 1.5454 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.2703 | 10.8035 | 18.5147 | nan | 115.7371 | 112.6876 | | hrnet_w18 | 128 | 6.2167 | 25.1556 | nan | 669.1961 | 113.693 | 107.3021 | | swin_base_patch4_window7_224 | 64 | 2.7415 | 10.2335 | nan | nan | 77.704 | 76.2832 | | mobilevit_s | 64 | 1.7197 | 6.0036 | nan | nan | 76.7031 | 74.9245 | | pnasnet5large | 16 | 4.6831 | 18.3408 | nan | 359.6742 | 76.4432 | 71.2181 | | xcit_large_24_p8_224 | 5 | 2.8779 | 13.6844 | 26.0114 | nan | 76.3666 | 71.1389 | | cait_m36_384 | 4 | 3.1177 | 14.7064 | nan | nan | 65.403 | 60.3843 | | convnext_base | 64 | 1.3223 | 5.0904 | nan | nan | 63.8614 | 61.3581 | | resnest101e | 64 | 3.3981 | 12.95 | nan | 277.242 | 61.4838 | 57.6169 | | jx_nest_base | 32 | 1.7379 | 7.4118 | 12.9995 | nan | 54.5834 | 52.4333 | | res2net101_26w_4s | 64 | 3.1441 | 13.7228 | 23.6803 | 241.1547 | 52.7449 | 49.4097 | | eca_halonext26ts | 128 | 1.419 | 4.5093 | nan | nan | 51.2401 | 49.947 | | coat_lite_mini | 128 | 1.1277 | 4.2761 | 6.8366 | 98.5454 | 48.8688 | 47.5401 | | res2net50_14w_8s | 128 | 2.8031 | 12.1065 | nan | 261.7347 | 48.3082 | 46.7948 | | poolformer_m36 | 64 | 1.7463 | 6.4854 | nan | nan | 48.0215 | 47.6687 | | sebotnet33ts_256 | 64 | 1.669 | 5.2686 | nan | nan | 43.9032 | 43.0309 | | gmlp_s16_224 | 128 | 1.047 | 5.2116 | nan | 156.4858 | 40.2012 | 38.0525 | | dpn107 | 32 | 4.1808 | 11.8108 | 36.0856 | 184.8839 | 39.4372 | 37.2134 | | fbnetv3_b | 128 | 3.3613 | 9.8974 | 24.9013 | 249.2909 | 38.923 | 35.9412 | | gluon_xception65 | 32 | 1.985 | 8.8835 | nan | 154.7623 | 37.1721 | 34.6029 | | crossvit_9_240 | 128 | 1.5606 | 6.615 | 10.8164 | 173.0489 | 36.957 | 34.866 | | eca_botnext26ts_256 | 128 | 1.412 | 4.3425 | nan | nan | 36.7377 | 35.3211 | | volo_d1_224 | 64 | 1.3274 | 6.083 | 10.0022 | nan | 36.6907 | 35.5262 | | tf_mixnet_l | 128 | 5.8586 | 11.4525 | nan | 154.9117 | 34.3258 | 31.0565 | | adv_inception_v3 | 128 | 1.6353 | 7.0018 | nan | 149.0446 | 34.1282 | 31.3658 | | ghostnet_100 | 128 | 3.095 | 8.4106 | 12.126 | 165.9901 | 33.6961 | 30.6639 | | gluon_inception_v3 | 128 | 1.6493 | 6.8111 | nan | 148.2294 | 33.4059 | 31.309 | | inception_v3 | 128 | 1.6587 | 6.9521 | nan | 151.5549 | 33.3694 | 31.3943 | | botnet26t_256 | 128 | 1.5082 | 4.1367 | 8.4049 | nan | 32.9775 | 31.9359 | | mixnet_l | 128 | 5.4903 | 11.1416 | nan | 151.1993 | 32.3336 | 29.425 | | dla102 | 128 | 1.8825 | 8.302 | nan | 181.7838 | 32.1479 | 29.2619 | | swsl_resnext101_32x16d | 32 | 1.7603 | 7.5763 | nan | 131.6982 | 31.1853 | 29.2609 | | gmixer_24_224 | 128 | 1.159 | 5.8575 | nan | 143.8761 | 30.9049 | 29.1822 | | dm_nfnet_f0 | 128 | 2.0409 | 6.1464 | nan | 156.4106 | 30.6019 | 28.5654 | | res2next50 | 128 | 1.6166 | 6.7418 | nan | 165.6631 | 28.207 | 26.0764 | | convit_base | 64 | 1.1393 | 4.6619 | nan | nan | 28.205 | 26.3049 | | rexnet_100 | 128 | 1.8931 | 6.2956 | nan | 144.5123 | 27.1619 | 24.9295 | | tinynet_a | 128 | 2.1813 | 6.8795 | 17.9192 | 143.082 | 26.392 | 24.3741 | | tf_efficientnet_b0 | 128 | 1.8625 | 5.7862 | nan | 127.2509 | 23.0203 | 21.7083 | | visformer_small | 128 | 1.0309 | 3.4918 | 5.3997 | nan | 22.8613 | 21.0854 | | mixer_b16_224 | 128 | 0.7495 | 2.6657 | nan | 71.301 | 22.8253 | 21.0477 | | cspdarknet53 | 64 | 2.3971 | 6.4294 | 16.9981 | 123.3593 | 22.7939 | 21.2386 | | convmixer_768_32 | 32 | 1.1908 | 5.1353 | nan | nan | 22.0254 | 20.497 | | nfnet_l0 | 128 | 1.8983 | 6.6577 | nan | 134.8562 | 21.969 | 20.0261 | | spnasnet_100 | 128 | 2.0815 | 5.8617 | 15.2523 | 109.035 | 21.6603 | 20.4581 | | fbnetc_100 | 128 | 2.0848 | 5.6339 | 15.5453 | 112.0608 | 21.5406 | 20.2234 | | resmlp_12_224 | 128 | 0.709 | 2.4268 | 3.9419 | 31.3307 | 21.4961 | 20.6781 | | mobilenetv3_large_100 | 128 | 1.5952 | 4.8104 | 11.779 | 122.432 | 20.2777 | 19.1933 | | deit_base_distilled_patch16_224 | 64 | 0.9352 | 3.561 | 5.5944 | 71.3377 | 19.9577 | 18.3657 | | repvgg_a2 | 128 | 2.043 | 5.2852 | 14.0603 | 156.8582 | 19.9564 | 18.882 | | beit_base_patch16_224 | 64 | 1.2273 | 4.4368 | nan | nan | 19.7423 | 18.7951 | | pit_b_224 | 64 | 0.9261 | 3.8243 | nan | 92.553 | 19.6736 | 19.2929 | | vit_base_patch16_224 | 64 | 0.8993 | 3.7123 | 5.5439 | 80.1553 | 19.1383 | 17.9592 | | mobilenetv2_100 | 128 | 1.7463 | 4.8527 | 12.2064 | 96.898 | 18.9879 | 18.4937 | | mnasnet_100 | 128 | 1.7507 | 4.6109 | 12.2981 | 89.6036 | 18.4327 | 17.0969 | | gernet_l | 128 | 2.0165 | 5.2894 | 13.8987 | 87.6961 | 18.1267 | 16.7255 | | regnety_002 | 128 | 1.6199 | 4.6006 | 11.3338 | 93.4709 | 17.883 | 16.5902 | | selecsls42b | 128 | 0.939 | 3.048 | 4.9076 | 77.2936 | 15.8738 | 14.8904 | | lcnet_050 | 128 | 1.0597 | 3.1335 | 6.7006 | 67.9278 | 13.294 | 12.2485 | | ese_vovnet19b_dw | 128 | 1.0843 | 2.6729 | 6.0204 | 54.9306 | 13.2012 | 11.7112 | | tnt_s_patch16_224 | 128 | 1.7198 | 8.2892 | nan | nan | nan | 31.8239 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.9898 | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.2908 | 1.4944 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1875 | 1.3423 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1803 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1107 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1106 | 1.3328 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0528 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 1.0219 | 1.0956 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9799 | 1.0636 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9764 | 1.0866 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9632 | 1.0419 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9616 | 1.054 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9576 | 0.9919 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.9519 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9069 | 1.0618 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0516 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8942 | 0.9056 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9034 | 0.9939 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8822 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8525 | 1.0753 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8441 | 1.0596 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.841 | 0.9709 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6493 | 0.869 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.5319 | 0.8172 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 365.111 | 364.9399 | nan | nan | 338.943 | 339.7685 | | hrnet_w18 | 128 | 416.8161 | 417.0336 | nan | 337.7725 | 293.4172 | 301.3698 | | pnasnet5large | 16 | 289.1341 | 289.1286 | nan | 265.0378 | 239.2994 | 242.4489 | | convnext_base | 64 | 262.882 | 263.2505 | nan | nan | 231.6086 | 237.83 | | tf_mixnet_l | 128 | 256.7006 | 284.2107 | nan | 230.8016 | 217.1257 | 216.9414 | | mixnet_l | 128 | 247.2735 | 274.796 | nan | 221.4112 | 212.4932 | 212.5998 | | swin_base_patch4_window7_224 | 64 | 237.1057 | 241.9696 | nan | nan | 208.1116 | 209.234 | | swsl_resnext101_32x16d | 32 | 219.4344 | 219.4514 | nan | 198.1393 | 198.0644 | 204.7625 | | dla102 | 128 | 269.5879 | 269.3262 | nan | 209.8383 | 194.6369 | 196.7843 | | resnest101e | 64 | 230.111 | 229.1924 | nan | 196.9169 | 175.3151 | 181.1929 | | cait_m36_384 | 4 | 216.6862 | 216.8107 | nan | nan | 174.1792 | 177.1459 | | inception_v3 | 128 | 226.4642 | 226.8107 | nan | 201.2808 | 170.7597 | 172.845 | | gluon_inception_v3 | 128 | 226.5121 | 226.7931 | nan | 200.7287 | 170.7548 | 173.0155 | | adv_inception_v3 | 128 | 226.2896 | 226.8108 | nan | 200.573 | 170.7204 | 173.1375 | | res2net50_14w_8s | 128 | 229.1721 | 229.127 | nan | 183.2535 | 169.0841 | 173.2397 | | gluon_xception65 | 32 | 182.7024 | 182.9733 | nan | 168.8103 | 167.8831 | 169.7134 | | convit_base | 64 | 196.5363 | 196.4872 | nan | nan | 160.5023 | 161.2492 | | res2next50 | 128 | 206.7413 | 206.4319 | nan | 174.9524 | 157.6515 | 162.0066 | | dpn107 | 32 | 191.1034 | 192.2145 | 234.4799 | 178.6931 | 153.8919 | 152.1858 | | gernet_l | 128 | 165.1028 | 165.2651 | 193.6984 | 146.5195 | 149.7263 | 149.9212 | | poolformer_m36 | 64 | 174.6199 | 174.416 | nan | nan | 149.514 | 152.242 | | coat_lite_mini | 128 | 191.5054 | 191.5188 | 226.5125 | 175.1621 | 146.8065 | 148.1677 | | mixer_b16_224 | 128 | 157.975 | 157.9309 | nan | 176.4451 | 146.2882 | 147.5323 | | dm_nfnet_f0 | 128 | 206.165 | 206.185 | nan | 180.8669 | 139.6476 | 144.8109 | | eca_halonext26ts | 128 | 169.3729 | 214.784 | nan | nan | 134.2369 | 136.8683 | | pit_b_224 | 64 | 158.3989 | 158.5075 | nan | 153.3343 | 131.6085 | 132.7689 | | eca_botnext26ts_256 | 128 | 163.498 | 208.5446 | nan | nan | 128.5416 | 129.914 | | nfnet_l0 | 128 | 176.3515 | 222.957 | nan | 158.8824 | 128.5321 | 132.7612 | | gmlp_s16_224 | 128 | 152.0286 | 152.1582 | nan | 140.0478 | 124.8602 | 126.0571 | | res2net101_26w_4s | 64 | 151.8488 | 151.954 | 196.6195 | 135.8192 | 123.5891 | 127.3953 | | visformer_small | 128 | 128.4297 | 128.1445 | 160.8353 | nan | 123.3099 | 127.2151 | | fbnetv3_b | 128 | 162.4775 | 163.1144 | 206.1833 | 126.7442 | 123.1711 | 121.1677 | | twins_pcpvt_base | 64 | 137.0113 | 137.1728 | 183.0654 | nan | 121.5887 | 123.947 | | botnet26t_256 | 128 | 152.3126 | 152.3467 | 190.0334 | nan | 117.566 | 117.9954 | | beit_base_patch16_224 | 64 | 128.3552 | 130.9056 | nan | nan | 115.2667 | 116.332 | | gmixer_24_224 | 128 | 146.3097 | 175.2799 | nan | 135.6481 | 111.8038 | 113.8962 | | volo_d1_224 | 64 | 153.2646 | 153.8624 | 191.3647 | nan | 110.7935 | 112.2007 | | vit_base_patch16_224 | 64 | 119.0363 | 119.0414 | 155.0992 | 125.0718 | 109.5879 | 110.7269 | | deit_base_distilled_patch16_224 | 64 | 119.7569 | 119.811 | 156.0149 | 122.0218 | 109.4133 | 110.5072 | | repvgg_a2 | 128 | 127.3251 | 127.4064 | 146.4939 | 107.8391 | 107.2611 | 106.8466 | | tf_efficientnet_b0 | 128 | 134.031 | 167.0622 | nan | 112.2047 | 103.928 | 103.5066 | | cspdarknet53 | 64 | 130.3415 | 130.7844 | 169.5669 | 106.6907 | 102.3307 | 101.1297 | | mobilevit_s | 64 | 117.1035 | 150.2707 | nan | nan | 97.6771 | 97.8287 | | xcit_large_24_p8_224 | 5 | 134.6892 | 135.8473 | 173.067 | nan | 96.4114 | 98.5532 | | rexnet_100 | 128 | 119.167 | 142.2284 | nan | 99.8027 | 95.7025 | 95.1033 | | fbnetc_100 | 128 | 123.4115 | 123.7422 | 150.9785 | 96.1535 | 95.3699 | 94.0607 | | jx_nest_base | 32 | 121.4609 | 121.7775 | 164.5655 | nan | 94.7023 | 96.4987 | | sebotnet33ts_256 | 64 | 114.3971 | 138.4302 | nan | nan | 89.1472 | 88.5199 | | tinynet_a | 128 | 109.8563 | 137.008 | 173.7498 | 92.4521 | 89.1056 | 88.2338 | | spnasnet_100 | 128 | 106.0072 | 106.2683 | 132.1701 | 83.1326 | 82.4383 | 81.1595 | | ese_vovnet19b_dw | 128 | 99.4529 | 99.7012 | 131.0904 | 84.8819 | 78.7551 | 78.1399 | | mnasnet_100 | 128 | 98.6467 | 99.0272 | 121.8108 | 75.8945 | 75.498 | 74.4164 | | crossvit_9_240 | 128 | 98.5068 | 98.4958 | 130.3508 | 94.6172 | 74.2308 | 75.292 | | selecsls42b | 128 | 89.6255 | 89.6797 | 109.8696 | 73.7537 | 70.6715 | 71.491 | | mobilenetv2_100 | 128 | 97.5532 | 97.9393 | 133.7157 | 73.1942 | 70.6457 | 69.9328 | | resmlp_12_224 | 128 | 71.2186 | 71.2238 | 102.4712 | 58.6466 | 70.5195 | 71.4086 | | mobilenetv3_large_100 | 128 | 85.5689 | 85.7007 | 108.0917 | 64.3748 | 61.8048 | 61.302 | | ghostnet_100 | 128 | 114.5835 | 117.7202 | 138.5476 | 92.3419 | 61.6325 | 62.581 | | regnety_002 | 128 | 53.3285 | 52.1444 | 60.2863 | 52.6015 | 35.0371 | 39.394 | | lcnet_050 | 128 | 38.2558 | 38.6404 | 47.7955 | 26.7251 | 22.1028 | 22.4227 | | tnt_s_patch16_224 | 128 | 470.7253 | 471.2808 | nan | nan | nan | 304.4949 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/afcKQQv.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/SiDMhDj.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/ASuYtdg.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 97%, 59/61  |
|     aot_cudagraphs     | 80%, 43/54 | 86%, 36/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 83%, 45/54 | 90%, 38/42  | 92%, 56/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 90%, 38/42  | 92%, 56/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.24x    |    1.11x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.04x    |    1.09x    |
|        inductor        |   1.90x    |    1.82x    |    1.42x    |
| inductor_no_cudagraphs |   1.38x    |    1.57x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.04    |    2.79     |    2.31     |
|       aot_eager        |    6.62    |    10.06    |    8.74     |
|     aot_cudagraphs     |    9.85    |    17.73    |    16.18    |
|    nvprims_nvfuser     |   63.52    |   113.44    |   148.51    |
|        inductor        |   32.87    |    35.84    |    43.12    |
| inductor_no_cudagraphs |   32.63    |    31.50    |    41.23    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.83x    |    0.89x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.83x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.95x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | hf_GPT2_large | 0.0 | 1.8602 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.8863 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6642 | 0.6591 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 409.4508 | 402.7862 | | torchbench | timm_efficientdet | 144.3823 | 142.0056 | | torchbench | hf_T5_large | 128.7946 | 123.3967 | | timm_models | hrnet_w18 | 142.5323 | 131.6701 | | timm_models | twins_pcpvt_base | 126.8506 | 125.7099 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | timm_vision_transformer_large | 0.879 | 0.9541 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8658 | 0.885 | | torchbench | fastNLP_Bert | 0.8657 | 1.0681 | | torchbench | Background_Matting | 0.8561 | 1.0426 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8384 | 0.9049 | | torchbench | hf_Bart | 0.8232 | 1.0097 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0017 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4478 | 0.4688 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0179 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0276 | | huggingface | PLBartForConditionalGeneration | 0.8843 | 1.0284 | | huggingface | DistilBertForMaskedLM | 0.8803 | 0.948 | | huggingface | MT5ForConditionalGeneration | 0.8751 | 0.919 | | huggingface | Speech2Text2ForCausalLM | 0.8691 | 0.9801 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | PLBartForCausalLM | 0.8549 | 0.9361 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | CamemBert | 0.8061 | 0.9309 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9905 | | huggingface | DistillGPT2 | 0.8046 | 1.024 | | huggingface | YituTechConvBert | 0.791 | 0.9314 | | huggingface | M2M100ForConditionalGeneration | 0.752 | 1.0322 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9454 | | huggingface | MobileBertForQuestionAnswering | 0.6085 | 0.8221 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | convnext_base | 0.885 | 0.9865 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | resmlp_12_224 | 0.8092 | 0.8236 | | timm_models | coat_lite_mini | 0.8033 | 1.0359 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | sebotnet33ts_256 | 0.7449 | 0.8293 | | timm_models | jx_nest_base | 0.6707 | 0.8617 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/geomean_over_time.png : ![](https://i.imgur.com/RcWEaWw.png) bench_logs/passrate_over_time.png : ![](https://i.imgur.com/Z1Hq4wV.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0007 | 0.9201 | 2.4762 | 0.7292 | 6.6286 | 1.3184 | | functorch_dp_cifar10 | 64 | 1.0011 | 0.9483 | 2.4456 | 0.0 | 4.906 | 1.3337 | | timm_efficientdet | 1 | 0.9854 | 0.8043 | 2.092 | 0.0 | 4.6683 | 1.5283 | | resnext50_32x4d | 8 | 1.001 | 0.9531 | 1.8575 | 0.7433 | 3.888 | 1.2605 | | BERT_pytorch | 16 | 1.0094 | 0.8336 | 1.5464 | 0.768 | 3.5373 | 2.3432 | | timm_vision_transformer | 8 | 1.0036 | 0.8522 | 1.9738 | 0.5984 | 3.3298 | 1.5809 | | hf_T5_large | 2 | 1.0215 | 0.8692 | 0.0 | 0.0 | 3.0354 | 2.1326 | | mobilenet_v3_large | 32 | 1.0031 | 1.0004 | 1.4991 | 0.7689 | 3.0238 | 1.4052 | | resnet18 | 16 | 1.0033 | 0.999 | 1.7155 | 0.8001 | 2.9759 | 1.2426 | | drq | 1 | 0.9935 | 0.8232 | 1.9496 | 0.5967 | 2.9519 | 1.1791 | | dcgan | 32 | 0.9809 | 0.9242 | 1.6465 | 0.7051 | 2.8585 | 1.0417 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.996 | 0.9801 | 1.4697 | 0.0 | 2.6951 | 1.5769 | | mnasnet1_0 | 32 | 0.9979 | 1.0211 | 1.2623 | 0.7689 | 2.6215 | 1.3024 | | squeezenet1_1 | 32 | 0.9951 | 0.9647 | 1.6465 | 0.7213 | 2.455 | 1.2919 | | hf_Albert | 8 | 1.0019 | 0.9563 | 0.7736 | 0.0 | 2.387 | 2.322 | | timm_efficientnet | 32 | 0.9605 | 0.8061 | 1.0713 | 0.6806 | 2.1054 | 1.2908 | | lennard_jones | 1000 | 0.9651 | 0.778 | 1.2778 | 0.5616 | 2.1029 | 1.0539 | | hf_Bert | 4 | 1.0337 | 0.8611 | 0.9275 | 0.0 | 2.0869 | 1.8233 | | pytorch_struct | 200 | 0.9907 | 0.7536 | 1.0434 | 0.6022 | 2.0869 | 1.2732 | | resnet152 | 32 | 1.0003 | 0.9958 | 1.3204 | 0.0 | 2.0263 | 1.2961 | | hf_GPT2 | 4 | 1.022 | 0.9873 | 0.8301 | 0.2889 | 1.9444 | 1.9065 | | timm_resnest | 32 | 1.0044 | 1.0228 | 0.8351 | 0.9631 | 1.9129 | 1.687 | | hf_T5 | 8 | 1.0002 | 0.9248 | 0.0 | 1.3535 | 1.8798 | 1.8883 | | LearningToPaint | 96 | 1.0007 | 1.0118 | 1.148 | 0.8382 | 1.8693 | 1.3046 | | hf_Bart | 4 | 1.0122 | 0.8315 | 0.8979 | 0.0 | 1.7973 | 1.7345 | | resnet50 | 32 | 1.0006 | 1.0182 | 1.0442 | 0.8065 | 1.772 | 1.3493 | | soft_actor_critic | 256 | 0.9935 | 0.7689 | 1.3689 | 0.5396 | 1.737 | 1.0492 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 1.0185 | 0.9595 | 0.8612 | 1.7082 | 1.439 | | attention_is_all_you_need_pytorch | 256 | 1.008 | 0.9106 | 0.8423 | 0.0 | 1.5935 | 1.5143 | | mobilenet_v2 | 96 | 1.0002 | 0.9893 | 0.7601 | 1.0166 | 1.5598 | 1.5165 | | fastNLP_Bert | 6 | 0.9982 | 0.9011 | 0.7709 | 0.0 | 1.5573 | 1.4756 | | speech_transformer | 32 | 1.0021 | 0.8264 | 1.7921 | 0.6603 | 1.5366 | 1.5537 | | hf_DistilBert | 8 | 1.0006 | 0.971 | 0.7431 | 0.3658 | 1.517 | 1.4911 | | timm_nfnet | 128 | 0.9996 | 0.9993 | 0.879 | 0.9194 | 1.5008 | 1.4297 | | pytorch_stargan | 16 | 0.9978 | 1.1003 | 1.0442 | 0.0 | 1.4482 | 1.3983 | | pytorch_unet | 1 | 0.9996 | 0.2118 | 0.0 | 0.0 | 1.3999 | 1.3653 | | timm_regnet | 32 | 0.9781 | 0.9391 | 0.8889 | 0.7832 | 1.3265 | 1.2351 | | timm_vovnet | 32 | 0.9224 | 0.8872 | 0.8517 | 0.7979 | 1.2979 | 1.1442 | | vgg16 | 64 | 0.9997 | 0.9967 | 0.8564 | 0.973 | 1.282 | 1.2649 | | Super_SloMo | 6 | 0.9989 | 0.177 | 0.0 | 0.0 | 1.2387 | 1.2016 | | alexnet | 128 | 0.9985 | 0.9975 | 0.814 | 0.9278 | 1.2114 | 1.2085 | | hf_Reformer | 4 | 0.9977 | 1.0021 | 0.9922 | 0.6472 | 1.1792 | 1.1765 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9902 | 0.0 | 0.0 | 1.1532 | 1.1321 | | Background_Matting | 4 | 1.0 | 0.1454 | 0.0 | 0.0 | 1.1406 | 1.1249 | | yolov3 | 16 | 0.9999 | 0.9905 | 0.8059 | 0.0 | 1.0885 | 1.0651 | | tts_angular | 64 | 0.9492 | 0.9452 | 0.9906 | 0.968 | 1.037 | 1.0098 | | demucs | 4 | 1.0005 | 0.9996 | 1.0005 | 1.0016 | 0.9976 | 0.9996 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9962 | 0.6967 | 1.0076 | 0.9891 | 1.0305 | | hf_BigBird | 2 | 0.9756 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 1.0025 | 0.9904 | 0.0 | 0.0 | 0.0 | 1.8602 | | dlrm | 2048 | 1.0306 | 1.1429 | 0.0 | 1.0491 | 0.0 | 0.0 | | tacotron2 | 64 | 0.9856 | 0.7648 | 1.0072 | 0.6039 | 0.0 | 0.8863 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.0685 | 8.4964 | 11.9317 | nan | 409.4508 | 402.7862 | | timm_efficientdet | 1 | 20.3891 | 37.6489 | 76.9929 | nan | 144.3823 | 142.0056 | | hf_T5_large | 2 | 14.7449 | 40.2166 | nan | nan | 128.7946 | 123.3967 | | timm_vision_transformer_large | 8 | 3.0418 | 15.434 | nan | nan | 74.0612 | 71.8288 | | resnet152 | 32 | 2.7221 | 13.547 | 23.0998 | nan | 52.5548 | 50.2677 | | densenet121 | 4 | 2.4509 | 11.8647 | 19.7536 | 240.3714 | 51.5753 | 50.9066 | | timm_resnest | 32 | 0.6412 | 2.5213 | 3.7813 | 67.0877 | 38.4295 | 38.158 | | attention_is_all_you_need_pytorch | 256 | 1.4375 | 7.1616 | 11.3428 | nan | 35.7883 | 34.6901 | | timm_vision_transformer | 8 | 1.0234 | 4.5556 | 7.0696 | 86.2447 | 35.2121 | 35.0009 | | BERT_pytorch | 16 | 1.8215 | 7.6397 | 11.4304 | 107.0575 | 33.2883 | 32.7388 | | hf_Bart | 4 | 2.0476 | 8.8781 | 13.8902 | nan | 33.0667 | 32.2375 | | timm_nfnet | 128 | 2.2209 | 7.2207 | 10.5981 | 163.643 | 32.3889 | 31.8998 | | fastNLP_Bert | 6 | 1.9027 | 7.1258 | 11.9471 | nan | 30.4157 | 27.1976 | | hf_T5 | 8 | 2.6587 | 9.0798 | nan | 94.4606 | 30.1767 | 28.9232 | | timm_regnet | 32 | 2.4938 | 7.9987 | 19.661 | 144.509 | 28.6543 | 28.0101 | | pytorch_stargan | 16 | 0.4611 | 1.9831 | 2.9243 | nan | 27.9265 | 27.9611 | | speech_transformer | 32 | 2.0364 | 8.7525 | 36.6519 | 189.5573 | 27.1388 | 26.0481 | | timm_efficientnet | 32 | 1.9871 | 6.7442 | 15.6147 | 148.6047 | 27.1069 | 26.6897 | | mobilenet_v3_large | 32 | 1.0488 | 4.8447 | 7.2447 | 120.9114 | 26.3332 | 25.4009 | | pytorch_struct | 200 | 0.2867 | 0.8764 | 1.6661 | 7.0818 | 23.6313 | 19.4693 | | functorch_dp_cifar10 | 64 | 0.348 | 1.3957 | 2.0848 | nan | 23.0406 | 22.5397 | | Super_SloMo | 6 | 1.1628 | 7.2716 | nan | nan | 21.7568 | 20.7208 | | hf_Bert | 4 | 1.8935 | 7.0467 | 10.1784 | nan | 21.3414 | 21.109 | | mnasnet1_0 | 32 | 0.9866 | 4.4063 | 6.431 | 87.7778 | 21.3158 | 21.4666 | | resnet50 | 32 | 0.9787 | 4.597 | 6.8623 | 99.7438 | 20.682 | 19.8614 | | shufflenet_v2_x1_0 | 128 | 1.1895 | 5.0476 | 7.51 | 102.2219 | 20.5539 | 19.9739 | | Background_Matting | 4 | 1.024 | 8.7125 | nan | nan | 20.5193 | 19.355 | | hf_Albert | 8 | 1.6313 | 6.5834 | 10.1392 | nan | 20.5161 | 19.7656 | | resnext50_32x4d | 8 | 1.0251 | 4.5222 | 6.8085 | 84.5711 | 20.4493 | 19.9728 | | hf_GPT2 | 4 | 1.7149 | 6.56 | 9.6719 | 84.3361 | 20.3476 | 19.8511 | | mobilenet_v2 | 96 | 1.0359 | 4.5445 | 6.9645 | 115.099 | 20.0205 | 19.0979 | | timm_vovnet | 32 | 1.606 | 4.3947 | 9.8423 | 70.8863 | 19.7495 | 19.5458 | | hf_Reformer | 4 | 1.6824 | 3.1205 | 5.3485 | 16.9324 | 18.9374 | 16.2977 | | hf_DistilBert | 8 | 0.8595 | 3.6247 | 6.1153 | 56.1085 | 13.9883 | 13.5692 | | resnet18 | 16 | 0.476 | 1.8069 | 2.711 | 37.8496 | 11.7165 | 11.1859 | | dcgan | 32 | 0.1817 | 0.4432 | 0.6606 | 5.0595 | 10.6237 | 9.8456 | | pytorch_unet | 1 | 0.5189 | 3.0207 | nan | nan | 9.8214 | 9.8059 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4739 | 1.9938 | 2.8034 | nan | 9.16 | 8.912 | | LearningToPaint | 96 | 0.5355 | 1.9213 | 2.8493 | 46.513 | 8.2851 | 7.8636 | | squeezenet1_1 | 32 | 0.2678 | 0.9312 | 1.3928 | 6.6845 | 4.8307 | 4.5792 | | vgg16 | 64 | 0.1988 | 0.6661 | 1.1411 | 4.8612 | 4.2962 | 3.9838 | | drq | 1 | 0.3215 | 0.6474 | 0.932 | 6.1606 | 4.108 | 3.7124 | | nvidia_deeprecommender | 256 | 0.2214 | 0.5168 | 0.8184 | 5.7613 | 3.6703 | 3.4022 | | soft_actor_critic | 256 | 0.21 | 0.3548 | 0.5428 | 3.2611 | 3.5879 | 3.0247 | | alexnet | 128 | 0.1704 | 0.445 | 0.7412 | 4.8164 | 3.4945 | 3.2228 | | lennard_jones | 1000 | 0.1595 | 0.3491 | 0.5406 | 3.0125 | 2.2011 | 2.0388 | | tts_angular | 64 | 0.1982 | 0.2509 | 0.366 | 1.4858 | 1.9465 | 1.7318 | | demucs | 4 | 0.3535 | 0.3605 | 0.3569 | 0.3723 | 0.2643 | 0.2695 | | hf_GPT2_large | 4 | 6.2111 | 19.8804 | nan | nan | nan | 54.5622 | | tacotron2 | 64 | 5.6086 | 18.9452 | 34.2264 | 88.1117 | nan | 45.9464 | | dlrm | 2048 | 0.4702 | 0.8449 | nan | 4.4375 | nan | nan | | hf_BigBird | 2 | 3.9766 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2717 | 0.4638 | 1.2042 | 1.2318 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9942 | 0.9812 | 0.3344 | 1.1938 | 1.0935 | 1.0972 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3557 | 0.4816 | 1.0334 | 1.1302 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.025 | 1.1759 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3085 | nan | 0.9991 | 1.0312 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | BERT_pytorch | 16 | 1.0003 | 0.8825 | 0.3998 | 1.1121 | 0.9743 | 1.1227 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.38 | 1.1204 | 0.9649 | 1.1241 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9647 | 1.2945 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4228 | nan | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9347 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.9309 | 1.252 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.9128 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.9113 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3536 | nan | 0.9063 | 1.0466 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | nan | nan | 0.879 | 0.9541 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.348 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8735 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3558 | 0.7806 | 0.8658 | 0.885 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | Background_Matting | 4 | 1.0138 | 0.6522 | nan | nan | 0.8561 | 1.0426 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3414 | 1.0708 | 0.8384 | 0.9049 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.8232 | 1.0097 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3447 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3408 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.389 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.734 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.7295 | 0.925 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3915 | 1.0881 | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0017 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4688 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 184.2114 | 186.0847 | nan | nan | 159.759 | 162.3859 | | Background_Matting | 4 | 133.3824 | 918.1168 | nan | nan | 117.2172 | 118.5089 | | hf_T5 | 8 | 174.0705 | 188.4067 | nan | 128.8906 | 92.6268 | 92.323 | | hf_T5_large | 2 | 215.0241 | 263.8511 | nan | nan | 88.4799 | 110.9625 | | timm_nfnet | 128 | 131.9367 | 131.0709 | 149.023 | 142.4796 | 87.2242 | 91.6452 | | hf_Reformer | 4 | 82.2001 | 82.8921 | 82.8427 | 127.0545 | 69.8827 | 69.749 | | Super_SloMo | 6 | 79.2526 | 448.969 | nan | nan | 64.5284 | 66.019 | | yolov3 | 16 | 68.5032 | 69.1748 | 85.2157 | nan | 62.9122 | 64.2597 | | demucs | 4 | 57.8479 | 57.2677 | 57.238 | 56.8078 | 56.9312 | 57.0243 | | timm_regnet | 32 | 73.2817 | 75.9854 | 80.565 | 92.6238 | 54.8144 | 62.2873 | | vgg16 | 64 | 66.6272 | 66.5717 | 77.1515 | 67.8544 | 52.0254 | 52.3651 | | resnet152 | 32 | 90.1157 | 92.6682 | 76.3196 | nan | 45.7902 | 71.8312 | | speech_transformer | 32 | 62.0333 | 72.1109 | 36.9661 | 106.7291 | 40.9786 | 40.851 | | fastNLP_Bert | 6 | 55.6207 | 61.5234 | 73.5695 | nan | 37.572 | 37.776 | | timm_efficientdet | 1 | 167.2292 | 198.9665 | 76.1624 | nan | 35.7143 | 109.2127 | | attention_is_all_you_need_pytorch | 256 | 53.2757 | 59.3003 | 63.1863 | nan | 33.379 | 35.0189 | | hf_Bart | 4 | 55.8657 | 68.7654 | 65.2176 | nan | 33.1078 | 34.8957 | | mobilenet_v2 | 96 | 48.7713 | 49.3075 | 64.3179 | 48.1231 | 31.3723 | 32.2507 | | hf_Albert | 8 | 68.1224 | 71.5049 | 88.2802 | nan | 28.6997 | 29.3614 | | pytorch_unet | 1 | 39.9287 | 188.4139 | nan | nan | 28.5166 | 29.2065 | | hf_GPT2 | 4 | 48.0558 | 50.272 | 60.0582 | 171.8425 | 25.3572 | 25.914 | | timm_vovnet | 32 | 34.768 | 35.8253 | 37.1346 | 40.8744 | 24.7218 | 28.4415 | | shufflenet_v2_x1_0 | 128 | 43.3934 | 39.9893 | 41.5253 | 50.1843 | 24.3148 | 28.8237 | | timm_efficientnet | 32 | 48.735 | 57.5763 | 43.1506 | 69.197 | 22.3918 | 37.4258 | | hf_Bert | 4 | 40.8462 | 48.1595 | 44.0785 | nan | 21.0172 | 24.3149 | | hf_DistilBert | 8 | 31.0986 | 32.1472 | 42.0053 | 85.0114 | 20.5113 | 20.8441 | | resnet50 | 32 | 33.0504 | 33.3632 | 32.2283 | 41.7405 | 19.3771 | 25.6509 | | BERT_pytorch | 16 | 53.4911 | 64.8912 | 34.9376 | 71.602 | 16.3485 | 24.3391 | | timm_resnest | 32 | 24.2716 | 23.9449 | 29.356 | 25.5003 | 12.8416 | 14.8268 | | mobilenet_v3_large | 32 | 36.8668 | 35.5899 | 23.9181 | 47.4652 | 11.9936 | 25.6414 | | mnasnet1_0 | 32 | 31.1862 | 29.2924 | 22.9533 | 39.1588 | 11.507 | 24.3416 | | densenet121 | 4 | 72.5515 | 79.8214 | 30.2885 | 103.1966 | 11.4138 | 58.5567 | | pytorch_stargan | 16 | 15.9101 | 14.5034 | 15.5135 | nan | 10.9274 | 11.398 | | nvidia_deeprecommender | 256 | 10.3759 | 10.398 | 14.8871 | 10.3084 | 10.4685 | 10.0597 | | timm_vision_transformer | 8 | 28.8374 | 34.4925 | 16.7824 | 49.7815 | 9.4632 | 19.8347 | | resnext50_32x4d | 8 | 28.982 | 30.3759 | 15.4783 | 40.3374 | 8.4997 | 24.084 | | LearningToPaint | 96 | 15.7477 | 15.9035 | 12.7326 | 17.9568 | 8.1222 | 11.8049 | | alexnet | 128 | 9.7877 | 9.8137 | 12.0157 | 10.6268 | 8.1126 | 8.1203 | | pytorch_CycleGAN_and_pix2pix | 1 | 17.6916 | 18.2184 | 11.6899 | nan | 6.6798 | 11.6526 | | tts_angular | 64 | 6.9358 | 7.2195 | 6.6263 | 6.5118 | 6.3103 | 6.5059 | | squeezenet1_1 | 32 | 14.6371 | 15.3096 | 10.1886 | 20.9072 | 6.2145 | 11.6627 | | resnet18 | 16 | 13.1564 | 12.941 | 8.4177 | 16.4799 | 4.7932 | 10.5696 | | functorch_dp_cifar10 | 64 | 13.7511 | 14.7874 | 5.7112 | nan | 2.9579 | 11.1526 | | pytorch_struct | 200 | 4.5243 | 6.2355 | 4.6431 | 7.6205 | 2.2599 | 3.6844 | | drq | 1 | 3.9648 | 4.765 | 2.0543 | 6.5735 | 1.3289 | 3.4736 | | dcgan | 32 | 3.1442 | 3.4359 | 1.8529 | 4.4941 | 1.1222 | 2.9261 | | soft_actor_critic | 256 | 1.418 | 1.9735 | 1.0737 | 2.806 | 0.9805 | 1.4545 | | lennard_jones | 1000 | 1.4572 | 1.9047 | 1.1911 | 4.2284 | 0.7404 | 1.4856 | | tacotron2 | 64 | 3044.8594 | 3960.0475 | 3026.1666 | 5016.1422 | nan | 3471.0613 | | hf_GPT2_large | 4 | 210.3664 | 211.716 | nan | nan | nan | 112.8747 | | dlrm | 2048 | 521.2748 | 460.7541 | nan | 503.3826 | nan | nan | | hf_BigBird | 2 | 192.9789 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0197 | 0.842 | 2.4001 | 0.0 | 5.4568 | 1.6357 | | MobileBertForQuestionAnswering | 64 | 1.0176 | 0.8313 | 1.5753 | 0.0 | 4.5264 | 1.8146 | | MobileBertForMaskedLM | 32 | 1.0201 | 0.8255 | 1.9299 | 0.0 | 4.4378 | 1.8148 | | CamemBert | 1 | 1.0418 | 0.868 | 1.823 | 0.0 | 3.6545 | 1.8074 | | MT5ForConditionalGeneration | 8 | 1.0178 | 0.863 | 1.5472 | 0.8654 | 3.649 | 2.4785 | | DistillGPT2 | 1 | 1.0388 | 0.8853 | 1.2748 | 0.0 | 2.7317 | 2.0479 | | GPT2ForSequenceClassification | 4 | 1.0004 | 0.9769 | 0.0 | 0.4987 | 2.32 | 2.2761 | | M2M100ForConditionalGeneration | 8 | 1.0375 | 0.9204 | 1.2396 | 0.7821 | 2.302 | 1.8802 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9788 | 0.7697 | 0.0 | 2.1281 | 2.0672 | | MegatronBertForQuestionAnswering | 16 | 1.0324 | 0.8731 | 1.1585 | 0.0 | 2.0336 | 1.7982 | | PLBartForConditionalGeneration | 16 | 1.0087 | 0.833 | 1.0513 | 0.0 | 1.9874 | 1.7761 | | PegasusForConditionalGeneration | 16 | 1.0102 | 0.8252 | 0.9263 | 0.6164 | 1.97 | 1.5653 | | MegatronBertForCausalLM | 16 | 1.0333 | 0.8611 | 0.9854 | 0.0 | 1.9007 | 1.774 | | T5Small | 1 | 1.0281 | 0.9032 | 1.1709 | 0.8416 | 1.8916 | 1.4173 | | ElectraForCausalLM | 32 | 1.0002 | 0.941 | 0.7186 | 0.0 | 1.833 | 1.8295 | | LayoutLMForSequenceClassification | 16 | 1.0005 | 0.9808 | 0.7748 | 0.0 | 1.8282 | 1.8073 | | XGLMForCausalLM | 8 | 1.0142 | 0.8294 | 0.9389 | 0.0 | 1.782 | 1.6191 | | MBartForConditionalGeneration | 16 | 1.0145 | 0.8471 | 0.9358 | 0.635 | 1.7234 | 1.5944 | | AlbertForQuestionAnswering | 4 | 1.0002 | 0.8859 | 0.0 | 0.0 | 1.667 | 1.6595 | | LayoutLMForMaskedLM | 16 | 1.0007 | 0.9711 | 0.7566 | 0.0 | 1.6529 | 1.6337 | | AlbertForMaskedLM | 4 | 1.0002 | 0.8851 | 0.0 | 0.0 | 1.6437 | 1.6479 | | T5ForConditionalGeneration | 4 | 1.0035 | 0.9002 | 0.7579 | 1.159 | 1.6096 | 1.5914 | | Speech2Text2ForCausalLM | 128 | 0.9992 | 0.9359 | 0.7257 | 0.8065 | 1.5866 | 1.5903 | | OPTForCausalLM | 32 | 1.0064 | 0.9302 | 0.7789 | 0.3318 | 1.5689 | 1.5569 | | BertForQuestionAnswering | 128 | 1.0 | 0.9831 | 0.7783 | 0.0 | 1.4967 | 1.4728 | | DistilBertForQuestionAnswering | 64 | 1.0002 | 0.9677 | 0.7421 | 0.3404 | 1.4927 | 1.4472 | | RobertaForQuestionAnswering | 128 | 1.0001 | 0.9844 | 0.7796 | 0.0 | 1.4906 | 1.4796 | | BartForConditionalGeneration | 2 | 1.0042 | 0.9687 | 0.0 | 0.0 | 1.461 | 1.4308 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0081 | 0.8877 | 0.7361 | 0.0 | 1.4461 | 1.4415 | | BartForCausalLM | 4 | 1.0003 | 0.9698 | 0.7581 | 0.0 | 1.4452 | 1.4442 | | RobertaForCausalLM | 64 | 1.0006 | 0.9592 | 0.7536 | 0.0 | 1.4407 | 1.43 | | BertForMaskedLM | 64 | 1.0008 | 0.9567 | 0.7407 | 0.0 | 1.3498 | 1.338 | | DebertaForMaskedLM | 4 | 0.917 | 0.7449 | 0.7917 | 0.0 | 1.3029 | 1.1371 | | PLBartForCausalLM | 32 | 1.0091 | 0.9403 | 0.8014 | 0.8426 | 1.2938 | 1.291 | | BlenderbotSmallForCausalLM | 64 | 1.002 | 0.9288 | 0.717 | 0.0 | 1.2689 | 1.2836 | | DistilBertForMaskedLM | 64 | 1.0005 | 0.9511 | 0.7099 | 0.4276 | 1.2624 | 1.2651 | | TrOCRForCausalLM | 32 | 1.0018 | 0.9482 | 0.7581 | 0.0 | 1.257 | 1.1926 | | MBartForCausalLM | 32 | 1.0016 | 0.9403 | 0.7581 | 0.8541 | 1.2045 | 1.1999 | | PegasusForCausalLM | 32 | 0.9998 | 0.9511 | 0.7506 | 0.856 | 1.1881 | 1.1971 | | DebertaForQuestionAnswering | 8 | 0.9909 | 0.7704 | 0.7237 | 0.0 | 1.1472 | 1.2154 | | BigBird | 1 | 0.9772 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.9585 | 11.4091 | 37.3542 | nan | 99.9838 | 40.0945 | | DebertaForMaskedLM | 4 | 4.9712 | 11.2798 | 37.9751 | nan | 98.6562 | 39.1755 | | XGLMForCausalLM | 8 | 3.1546 | 13.4625 | 28.4342 | nan | 78.4341 | 75.9032 | | MobileBertForMaskedLM | 32 | 10.1369 | 32.6243 | 61.3695 | nan | 72.3937 | 70.5013 | | MobileBertForQuestionAnswering | 64 | 10.1873 | 32.8743 | 59.0247 | nan | 71.791 | 70.8518 | | M2M100ForConditionalGeneration | 8 | 4.0808 | 16.1999 | 28.6255 | 310.2042 | 70.2783 | 61.9564 | | PegasusForConditionalGeneration | 16 | 3.7957 | 16.9744 | 27.8716 | 370.018 | 56.5913 | 52.6172 | | MBartForConditionalGeneration | 16 | 3.9552 | 17.5144 | 29.4495 | 407.9875 | 56.237 | 54.4364 | | BartForConditionalGeneration | 2 | 3.8799 | 17.183 | nan | nan | 54.4241 | 53.7068 | | YituTechConvBert | 1 | 2.8786 | 10.9668 | 16.7336 | nan | 51.0179 | 48.8929 | | MegatronBertForCausalLM | 16 | 3.9585 | 14.6148 | 22.7757 | nan | 44.1343 | 43.1976 | | MegatronBertForQuestionAnswering | 16 | 4.1441 | 14.9002 | 23.3754 | nan | 42.782 | 42.1179 | | MT5ForConditionalGeneration | 8 | 4.0606 | 13.0367 | 21.5418 | 154.4871 | 39.369 | 38.0927 | | BlenderbotSmallForConditionalGeneration | 64 | 2.5018 | 11.4113 | 18.4481 | nan | 37.1132 | 35.7655 | | T5Small | 1 | 2.6407 | 8.8132 | 13.4137 | 98.4442 | 32.4872 | 31.1805 | | T5ForConditionalGeneration | 4 | 2.6464 | 8.8966 | 13.5949 | 96.2244 | 31.4137 | 30.1122 | | PLBartForConditionalGeneration | 16 | 2.1042 | 8.9714 | 13.6651 | nan | 31.3048 | 30.537 | | LayoutLMForSequenceClassification | 16 | 2.3199 | 7.7231 | 11.5171 | nan | 28.6872 | 28.2579 | | ElectraForCausalLM | 32 | 2.0157 | 7.2714 | 11.5953 | nan | 28.0226 | 26.2221 | | PegasusForCausalLM | 32 | 1.5465 | 6.6371 | 10.5308 | 109.6778 | 25.912 | 24.8945 | | MBartForCausalLM | 32 | 1.4831 | 6.5671 | 10.0414 | 114.0127 | 24.4232 | 23.848 | | TrOCRForCausalLM | 32 | 1.5075 | 6.5008 | 9.7879 | nan | 24.3083 | 22.676 | | OPTForCausalLM | 32 | 1.5456 | 6.6795 | 11.6407 | 104.5929 | 23.569 | 22.4384 | | LayoutLMForMaskedLM | 16 | 2.2935 | 7.6261 | 11.5225 | nan | 23.4003 | 22.7657 | | RobertaForCausalLM | 64 | 1.8999 | 7.2838 | 10.7236 | nan | 23.2878 | 22.1357 | | BartForCausalLM | 4 | 1.5488 | 6.7712 | 9.9351 | nan | 23.2727 | 22.4386 | | BertForMaskedLM | 64 | 1.8474 | 7.4501 | 10.5083 | nan | 22.8736 | 21.9793 | | ElectraForQuestionAnswering | 64 | 1.9293 | 7.2696 | 10.8111 | nan | 22.2883 | 21.4111 | | BertForQuestionAnswering | 128 | 1.9365 | 7.0691 | 10.5371 | nan | 21.7074 | 21.1842 | | RobertaForQuestionAnswering | 128 | 1.9526 | 7.0643 | 10.7115 | nan | 20.8503 | 20.263 | | CamemBert | 1 | 1.9479 | 7.0513 | 10.1827 | nan | 20.5315 | 20.2418 | | AlbertForMaskedLM | 4 | 1.6154 | 6.6714 | nan | nan | 20.0449 | 18.882 | | GPT2ForSequenceClassification | 4 | 1.7548 | 6.6675 | nan | 88.2821 | 19.9714 | 19.2929 | | AlbertForQuestionAnswering | 4 | 1.8166 | 6.8368 | nan | nan | 19.4582 | 18.3472 | | BlenderbotSmallForCausalLM | 64 | 1.0381 | 4.5717 | 6.5145 | nan | 17.2682 | 17.002 | | Speech2Text2ForCausalLM | 128 | 0.8938 | 3.4583 | 5.7457 | 50.773 | 16.4178 | 14.7939 | | PLBartForCausalLM | 32 | 0.8598 | 3.6024 | 5.0834 | 67.0067 | 15.3821 | 15.1554 | | DistillGPT2 | 1 | 0.9375 | 3.4179 | 4.8496 | nan | 13.7631 | 13.5574 | | DistilBertForMaskedLM | 64 | 0.7991 | 3.6217 | 5.9745 | 59.0386 | 13.1817 | 12.8907 | | DistilBertForQuestionAnswering | 64 | 0.8553 | 3.4743 | 6.2487 | 56.2889 | 12.5466 | 12.1871 | | BigBird | 1 | 4.0116 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.2229 | 1.0775 | 1.1712 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1296 | 0.9708 | 1.0342 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9978 | 0.965 | 1.1391 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | 1.1057 | 0.9417 | 1.0114 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 1.0 | 0.8994 | 0.3787 | nan | 0.9289 | 0.9789 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.347 | 1.1114 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3726 | 1.0333 | 0.9251 | 1.0063 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | 0.9984 | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.921 | 0.9877 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0276 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4146 | nan | 0.8843 | 1.0284 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | 1.0791 | 0.8803 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.8751 | 0.919 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0386 | 0.8691 | 0.9801 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.856 | 0.9327 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.8549 | 0.9361 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | CamemBert | 1 | 0.999 | 0.8143 | 0.416 | nan | 0.8061 | 0.9309 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9905 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.402 | nan | 0.8046 | 1.024 | | YituTechConvBert | 1 | 0.9718 | 0.868 | 0.4315 | nan | 0.791 | 0.9314 | | M2M100ForConditionalGeneration | 8 | 0.9967 | 0.9511 | 0.4275 | 1.0565 | 0.752 | 1.0322 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3624 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3252 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.974 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.6604 | 301.3427 | nan | nan | 162.9413 | 162.059 | | AlbertForQuestionAnswering | 4 | 264.4808 | 298.8341 | nan | nan | 159.1088 | 159.6977 | | BartForConditionalGeneration | 2 | 135.271 | 141.3413 | nan | nan | 93.3207 | 95.6594 | | BartForCausalLM | 4 | 112.015 | 115.6561 | 148.1633 | nan | 77.6567 | 77.5993 | | BlenderbotSmallForConditionalGeneration | 64 | 109.2552 | 126.8971 | 150.4612 | nan | 76.8695 | 76.6595 | | RobertaForQuestionAnswering | 128 | 111.2768 | 112.8522 | 142.6823 | nan | 74.5204 | 75.1308 | | BertForQuestionAnswering | 128 | 110.3618 | 112.5407 | 142.1248 | nan | 74.0406 | 74.9912 | | LayoutLMForMaskedLM | 16 | 112.0183 | 115.4454 | 148.1512 | nan | 67.7407 | 68.5935 | | DebertaForQuestionAnswering | 8 | 75.7588 | 97.376 | 104.0467 | nan | 65.7431 | 61.8012 | | MBartForConditionalGeneration | 16 | 101.9575 | 124.7547 | 114.2629 | 168.824 | 64.6872 | 68.4081 | | PegasusForConditionalGeneration | 16 | 101.8724 | 148.7831 | 113.2127 | 170.6775 | 64.636 | 70.6954 | | T5ForConditionalGeneration | 4 | 101.3426 | 113.0934 | 134.6547 | 87.2756 | 62.8467 | 63.8368 | | PegasusForCausalLM | 32 | 68.8716 | 72.381 | 91.6837 | 81.1486 | 58.6214 | 58.8773 | | TrOCRForCausalLM | 32 | 69.867 | 73.4382 | 92.2609 | nan | 58.4622 | 58.4164 | | MBartForCausalLM | 32 | 69.6871 | 74.4545 | 92.2879 | 81.5412 | 58.2847 | 58.3666 | | BertForMaskedLM | 64 | 75.449 | 79.0134 | 102.0566 | nan | 56.0654 | 56.3933 | | RobertaForCausalLM | 64 | 80.5091 | 83.7896 | 106.9013 | nan | 55.9586 | 56.1973 | | ElectraForQuestionAnswering | 64 | 114.4074 | 117.1781 | 150.2274 | nan | 53.8591 | 55.3516 | | LayoutLMForSequenceClassification | 16 | 97.0086 | 99.0108 | 125.4142 | nan | 53.2153 | 53.7945 | | XGLMForCausalLM | 8 | 86.0073 | 106.6089 | 94.051 | nan | 51.4787 | 57.8038 | | MobileBertForQuestionAnswering | 64 | 178.1963 | 268.3399 | 122.9527 | nan | 50.5206 | 103.7317 | | M2M100ForConditionalGeneration | 8 | 103.3453 | 128.7306 | 88.8353 | 161.3089 | 49.6261 | 60.6081 | | DebertaForMaskedLM | 4 | 66.8287 | 100.0106 | 79.4089 | nan | 49.4011 | 55.7076 | | ElectraForCausalLM | 32 | 87.487 | 92.6804 | 123.1597 | nan | 47.6311 | 47.6928 | | BlenderbotSmallForCausalLM | 64 | 58.5713 | 63.8285 | 81.5128 | nan | 46.3904 | 45.9091 | | MegatronBertForCausalLM | 16 | 78.2756 | 95.2561 | 83.8187 | nan | 45.7131 | 47.9207 | | MegatronBertForQuestionAnswering | 16 | 80.5915 | 95.2697 | 77.7844 | nan | 42.0621 | 47.4418 | | MobileBertForMaskedLM | 32 | 177.8445 | 211.778 | 108.5119 | nan | 41.1744 | 101.6235 | | GPT2ForSequenceClassification | 4 | 90.4981 | 94.385 | nan | 182.2956 | 39.0829 | 39.7366 | | T5Small | 1 | 63.4815 | 72.1299 | 53.0899 | 74.5183 | 38.4374 | 45.3175 | | DistilBertForMaskedLM | 64 | 45.1635 | 47.6381 | 63.8263 | 106.7402 | 35.7986 | 35.7315 | | OPTForCausalLM | 32 | 53.9204 | 58.4298 | 70.0095 | 164.9181 | 34.6779 | 35.1003 | | PLBartForCausalLM | 32 | 39.1681 | 41.9381 | 49.2851 | 46.6411 | 30.5813 | 30.6012 | | PLBartForConditionalGeneration | 16 | 64.6545 | 66.8269 | 53.5284 | nan | 29.5795 | 33.2547 | | MT5ForConditionalGeneration | 8 | 104.0639 | 122.3635 | 57.735 | 102.7905 | 25.768 | 37.2256 | | DistilBertForQuestionAnswering | 64 | 30.5263 | 31.5702 | 41.1354 | 89.9873 | 20.4118 | 21.0319 | | Speech2Text2ForCausalLM | 128 | 30.2995 | 32.6285 | 42.3597 | 37.9215 | 19.4295 | 19.2431 | | YituTechConvBert | 1 | 63.5991 | 76.0779 | 27.1782 | nan | 14.5443 | 40.8721 | | CamemBert | 1 | 37.024 | 45.0666 | 22.0036 | nan | 11.0809 | 22.6693 | | DistillGPT2 | 1 | 19.6298 | 24.4241 | 16.1095 | nan | 7.9468 | 10.4823 | | BigBird | 1 | 189.3597 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.003 | 0.0 | 0.0 | 0.0 | 2.1892 | 1.8395 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9983 | 0.0 | 0.0 | 2.1287 | 2.0915 | | ghostnet_100 | 128 | 1.0047 | 0.983 | 0.894 | 1.0143 | 2.1126 | 1.7537 | | regnety_002 | 128 | 0.9783 | 0.9414 | 1.128 | 0.8548 | 2.11 | 1.4392 | | lcnet_050 | 128 | 0.9696 | 0.9472 | 0.8533 | 1.0346 | 2.0175 | 1.615 | | twins_pcpvt_base | 64 | 1.0053 | 0.9219 | 0.9192 | 0.0 | 1.7636 | 1.6748 | | res2net101_26w_4s | 64 | 1.0198 | 1.0091 | 0.9527 | 0.0 | 1.6736 | 1.3351 | | coat_lite_mini | 128 | 1.0001 | 0.9944 | 0.8469 | 1.1512 | 1.6547 | 1.6363 | | hrnet_w18 | 128 | 1.0077 | 1.0303 | 0.8741 | 0.0 | 1.6055 | 1.4709 | | volo_d1_224 | 64 | 0.9999 | 0.9944 | 0.8459 | 0.0 | 1.6003 | 1.5649 | | dla102 | 128 | 1.0 | 0.9932 | 0.8355 | 1.3139 | 1.5804 | 1.5491 | | gmlp_s16_224 | 128 | 0.9997 | 0.9913 | 0.7868 | 1.0112 | 1.5742 | 1.5298 | | gmixer_24_224 | 128 | 1.0 | 0.8801 | 0.7227 | 0.9216 | 1.5589 | 1.4927 | | nfnet_l0 | 128 | 0.9986 | 0.8099 | 0.7113 | 0.8475 | 1.5553 | 1.4563 | | resnest101e | 64 | 0.9996 | 1.0069 | 0.8094 | 0.0 | 1.5209 | 1.4298 | | cait_m36_384 | 4 | 1.0002 | 0.8885 | 0.0 | 0.0 | 1.5061 | 1.4554 | | gluon_inception_v3 | 128 | 1.0 | 0.9966 | 0.8545 | 1.1428 | 1.505 | 1.4712 | | adv_inception_v3 | 128 | 0.9998 | 0.9959 | 0.8542 | 1.1424 | 1.5028 | 1.4711 | | inception_v3 | 128 | 1.0 | 0.9961 | 0.8538 | 1.1422 | 1.5007 | 1.4679 | | dm_nfnet_f0 | 128 | 0.999 | 0.9989 | 0.8804 | 0.9226 | 1.5002 | 1.4301 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9606 | 0.0 | 0.0 | 1.499 | 1.4687 | | res2net50_14w_8s | 128 | 0.9999 | 0.993 | 0.8113 | 1.0076 | 1.4749 | 1.4315 | | mobilenetv3_large_100 | 128 | 0.9551 | 0.945 | 0.7828 | 0.9856 | 1.4524 | 1.4281 | | crossvit_9_240 | 128 | 1.0001 | 0.9943 | 0.838 | 0.9191 | 1.4489 | 1.4198 | | fbnetv3_b | 128 | 0.9521 | 0.9491 | 0.7743 | 0.0 | 1.4424 | 1.4001 | | selecsls42b | 128 | 0.9996 | 0.9955 | 0.8427 | 1.282 | 1.4403 | 1.4107 | | mnasnet_100 | 128 | 0.9536 | 0.9447 | 0.7894 | 1.205 | 1.4327 | 1.4544 | | res2next50 | 128 | 0.9995 | 0.9959 | 0.8334 | 1.1359 | 1.4114 | 1.3461 | | mobilenetv2_100 | 128 | 0.9511 | 0.9414 | 0.7227 | 1.1316 | 1.4062 | 1.4211 | | jx_nest_base | 32 | 0.9998 | 0.9934 | 0.7961 | 0.0 | 1.4032 | 1.3577 | | mobilevit_s | 64 | 0.9732 | 0.8142 | 0.6562 | 0.0 | 1.3899 | 1.3811 | | ese_vovnet19b_dw | 128 | 0.97 | 0.9643 | 0.7685 | 1.1287 | 1.3789 | 1.3789 | | resmlp_12_224 | 128 | 1.0 | 0.999 | 0.7826 | 1.4881 | 1.3788 | 1.3656 | | spnasnet_100 | 128 | 0.9463 | 0.9379 | 0.7777 | 1.0953 | 1.3695 | 1.3937 | | convit_base | 64 | 1.0 | 0.9965 | 0.8333 | 1.2357 | 1.365 | 1.4013 | | fbnetc_100 | 128 | 0.9514 | 0.9441 | 0.7931 | 1.15 | 1.354 | 1.3754 | | tf_efficientnet_b0 | 128 | 0.9655 | 0.8079 | 0.6673 | 0.9497 | 1.3492 | 1.3562 | | pit_b_224 | 64 | 0.9997 | 0.9957 | 0.8222 | 0.9733 | 1.348 | 1.3436 | | botnet26t_256 | 128 | 0.9795 | 0.9745 | 0.8124 | 1.2774 | 1.3272 | 1.3311 | | poolformer_m36 | 64 | 0.9998 | 0.9981 | 0.8071 | 0.0 | 1.3249 | 1.2969 | | pnasnet5large | 16 | 1.0085 | 1.0329 | 0.8483 | 0.0 | 1.3151 | 1.2725 | | cspdarknet53 | 64 | 0.9421 | 0.9341 | 0.7565 | 1.133 | 1.3093 | 1.3215 | | mixer_b16_224 | 128 | 1.0 | 0.9978 | 0.803 | 0.8975 | 1.2887 | 1.2775 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9914 | 0.7972 | 0.9759 | 1.284 | 1.2651 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9786 | 0.0 | 0.0 | 1.2809 | 1.2659 | | rexnet_100 | 128 | 0.9651 | 0.8503 | 0.6918 | 0.0 | 1.2792 | 1.2779 | | eca_botnext26ts_256 | 128 | 0.9811 | 0.8114 | 0.6718 | 1.0723 | 1.2779 | 1.2708 | | tinynet_a | 128 | 0.9683 | 0.8085 | 0.6546 | 0.7822 | 1.2697 | 1.2628 | | visformer_small | 128 | 1.0003 | 1.0017 | 0.841 | 0.0 | 1.2281 | 1.1759 | | sebotnet33ts_256 | 64 | 0.9663 | 0.8367 | 0.6802 | 0.9678 | 1.1974 | 1.201 | | vit_base_patch16_224 | 64 | 1.0 | 0.9939 | 0.8351 | 0.9114 | 1.196 | 1.1798 | | tf_mixnet_l | 128 | 0.9807 | 0.9093 | 0.7948 | 0.0 | 1.1794 | 1.1743 | | mixnet_l | 128 | 0.9796 | 0.9051 | 0.7897 | 0.0 | 1.1627 | 1.1564 | | gluon_xception65 | 32 | 0.9997 | 0.9888 | 0.7549 | 0.0 | 1.1539 | 1.1248 | | dpn107 | 32 | 0.9607 | 0.9323 | 0.7522 | 0.0 | 1.1469 | 1.1788 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9808 | 0.8095 | 0.0 | 1.1308 | 1.0566 | | repvgg_a2 | 128 | 0.9431 | 0.9355 | 0.7933 | 1.0695 | 1.1038 | 1.1182 | | gernet_l | 128 | 0.9467 | 0.9377 | 0.7653 | 1.062 | 1.0666 | 1.077 | | convmixer_768_32 | 32 | 0.9998 | 0.9975 | 0.9232 | 0.0 | 1.0559 | 1.0491 | | convnext_base | 64 | 0.9993 | 0.9947 | 0.8001 | 0.0 | 0.6642 | 0.6591 | | eca_halonext26ts | 128 | 0.9807 | 0.8157 | 0.6792 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.931 | 30.7401 | 58.6979 | nan | 142.5323 | 131.6701 | | twins_pcpvt_base | 64 | 2.9738 | 14.6442 | 25.7315 | nan | 126.8506 | 125.7099 | | pnasnet5large | 16 | 5.2804 | 22.3685 | 41.3626 | nan | 92.4475 | 87.4784 | | xcit_large_24_p8_224 | 5 | 3.4258 | nan | nan | nan | 87.3087 | 83.5833 | | swin_base_patch4_window7_224 | 64 | 3.3882 | 12.8961 | nan | nan | 81.9176 | 80.2635 | | cait_m36_384 | 4 | 3.7808 | 19.7953 | nan | nan | 76.9766 | 72.9069 | | resnest101e | 64 | 3.6615 | 17.6271 | 27.7617 | nan | 75.1648 | 74.6052 | | convnext_base | 64 | 1.5798 | 6.98 | 11.3547 | nan | 73.1379 | 72.8893 | | mobilevit_s | 64 | 2.0122 | 7.6871 | 15.6413 | nan | 68.5941 | 66.8229 | | res2net101_26w_4s | 64 | 3.6303 | 16.6415 | 27.957 | nan | 63.4994 | 59.575 | | jx_nest_base | 32 | 2.0035 | 9.1173 | 16.022 | nan | 63.1295 | 61.3127 | | coat_lite_mini | 128 | 1.2908 | 5.2779 | 8.3825 | 115.6852 | 59.3284 | 57.9552 | | res2net50_14w_8s | 128 | 3.1229 | 14.6313 | 24.629 | 345.9114 | 57.5831 | 54.2875 | | poolformer_m36 | 64 | 1.8805 | 7.3353 | 12.1435 | nan | 52.5525 | 49.5236 | | sebotnet33ts_256 | 64 | 1.9237 | 6.1743 | 13.2933 | 154.5429 | 48.5149 | 46.47 | | dpn107 | 32 | 4.5908 | 13.8556 | 39.4827 | nan | 46.1434 | 43.5194 | | gluon_xception65 | 32 | 2.2854 | 11.0813 | 18.1623 | nan | 45.6181 | 42.1045 | | fbnetv3_b | 128 | 3.4992 | 12.1504 | 28.4002 | nan | 45.4094 | 42.6809 | | gmlp_s16_224 | 128 | 1.3999 | 7.4855 | 12.1747 | 203.0761 | 45.2984 | 43.7263 | | tnt_s_patch16_224 | 128 | 2.066 | 10.6506 | nan | nan | 42.793 | 38.5401 | | volo_d1_224 | 64 | 1.482 | 7.497 | 12.2298 | nan | 42.348 | 41.2258 | | crossvit_9_240 | 128 | 1.817 | 8.4684 | 13.4886 | 204.5446 | 41.7992 | 40.0393 | | eca_botnext26ts_256 | 128 | 1.4173 | 4.9213 | 10.4289 | 121.5053 | 40.564 | 38.9199 | | gluon_inception_v3 | 128 | 1.8871 | 8.4286 | 13.4629 | 185.543 | 38.7639 | 36.4388 | | inception_v3 | 128 | 1.7863 | 8.4812 | 13.462 | 184.9059 | 38.7212 | 36.2625 | | adv_inception_v3 | 128 | 1.9318 | 8.4168 | 13.4298 | 189.824 | 38.6589 | 36.6151 | | tf_mixnet_l | 128 | 6.0526 | 13.0653 | 27.0128 | nan | 38.1865 | 35.8228 | | dla102 | 128 | 2.0724 | 9.7057 | 16.0251 | 243.8107 | 37.9528 | 35.6838 | | ghostnet_100 | 128 | 3.3094 | 9.9326 | 14.8319 | 194.7976 | 37.7097 | 37.3868 | | mixnet_l | 128 | 5.6976 | 12.8775 | 28.234 | nan | 37.3645 | 34.7884 | | swsl_resnext101_32x16d | 32 | 2.0848 | 9.3048 | 14.6905 | nan | 36.5758 | 35.1714 | | gmixer_24_224 | 128 | 1.6331 | 8.2084 | 13.5465 | 186.3879 | 36.1414 | 34.2949 | | botnet26t_256 | 128 | 1.4703 | 4.4026 | 9.3454 | 91.3544 | 34.4281 | 33.4412 | | dm_nfnet_f0 | 128 | 2.1912 | 7.8902 | 11.1043 | 158.8081 | 34.0332 | 31.8129 | | res2next50 | 128 | 1.8036 | 8.0888 | 12.9439 | 198.2322 | 32.6142 | 30.731 | | tinynet_a | 128 | 2.3532 | 8.1809 | 19.8612 | 195.0232 | 31.6431 | 29.3387 | | rexnet_100 | 128 | 2.0816 | 7.4863 | 17.3374 | nan | 31.5572 | 29.6037 | | convit_base | 64 | 1.3122 | 6.0107 | 10.0807 | 147.3174 | 30.9871 | 29.3949 | | cspdarknet53 | 64 | 2.5838 | 7.4509 | 18.9471 | 146.7427 | 27.1061 | 25.2952 | | tf_efficientnet_b0 | 128 | 2.0106 | 6.9416 | 16.429 | 182.0044 | 26.9815 | 25.5845 | | convmixer_768_32 | 32 | 1.4065 | 6.4721 | 9.9094 | nan | 25.8546 | 25.2263 | | fbnetc_100 | 128 | 2.3014 | 6.8053 | 17.4548 | 136.8819 | 25.7144 | 24.2387 | | mixer_b16_224 | 128 | 0.9793 | 3.8537 | 5.9731 | 86.5881 | 25.3005 | 24.5996 | | mobilenetv3_large_100 | 128 | 1.7463 | 5.7444 | 13.4289 | 143.0498 | 25.2761 | 22.675 | | spnasnet_100 | 128 | 2.2206 | 6.6813 | 16.9706 | 135.4312 | 25.2591 | 23.9768 | | visformer_small | 128 | 1.106 | 4.0464 | 6.3061 | nan | 25.1474 | 23.894 | | deit_base_distilled_patch16_224 | 64 | 1.1488 | 4.8066 | 7.3554 | 89.0701 | 24.7207 | 23.3586 | | pit_b_224 | 64 | 1.1353 | 5.2141 | 8.3164 | 109.3932 | 24.6725 | 23.8192 | | nfnet_l0 | 128 | 2.0483 | 7.3006 | 10.8474 | 148.4647 | 24.2357 | 23.5787 | | vit_base_patch16_224 | 64 | 1.1285 | 4.6221 | 7.21 | 86.5021 | 24.1113 | 23.3259 | | resmlp_12_224 | 128 | 0.8091 | 3.0904 | 4.8725 | 50.9869 | 23.6155 | 23.4652 | | beit_base_patch16_224 | 64 | 1.4768 | 5.4295 | nan | nan | 23.0677 | 22.4542 | | repvgg_a2 | 128 | 2.2835 | 6.1529 | 15.7903 | 186.6594 | 22.306 | 21.4525 | | mobilenetv2_100 | 128 | 1.9074 | 5.6638 | 13.281 | 113.967 | 21.8558 | 21.5107 | | regnety_002 | 128 | 1.7572 | 5.6279 | 13.2906 | 117.269 | 21.2603 | 20.5616 | | mnasnet_100 | 128 | 1.8697 | 5.5091 | 13.3609 | 108.3648 | 21.1641 | 20.0444 | | gernet_l | 128 | 2.2389 | 6.1854 | 15.9676 | 112.1682 | 21.0165 | 19.9186 | | selecsls42b | 128 | 0.9489 | 3.7853 | 6.2302 | 88.4748 | 18.7873 | 17.4627 | | lcnet_050 | 128 | 1.1382 | 3.4543 | 7.485 | 78.1401 | 15.1857 | 14.5235 | | ese_vovnet19b_dw | 128 | 1.15 | 3.1606 | 6.7522 | 66.4355 | 14.5221 | 13.6098 | | eca_halonext26ts | 128 | 1.4558 | 5.0927 | 11.1443 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2767 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2733 | nan | 1.1831 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.1605 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.476 | 1.1067 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 1.0332 | 1.1822 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2248 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3402 | nan | 0.9699 | 1.0818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9556 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9318 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9804 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9113 | 0.981 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.9072 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.3591 | 1.2167 | 0.8911 | 0.8962 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.885 | 0.9865 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.8386 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8236 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.8033 | 1.0359 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.7449 | 0.8293 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6707 | 0.8617 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.9945 | 297.4343 | 321.4081 | nan | 281.092 | 282.7367 | | hrnet_w18 | 128 | 297.8396 | 290.3868 | 350.2986 | nan | 187.874 | 202.6381 | | convnext_base | 64 | 121.341 | 121.9435 | 151.5422 | nan | 182.6722 | 183.8139 | | tnt_s_patch16_224 | 128 | 363.5377 | 363.7751 | nan | nan | 170.5356 | 173.5391 | | pnasnet5large | 16 | 219.8939 | 210.8888 | 258.6478 | nan | 169.2702 | 173.4501 | | tf_mixnet_l | 128 | 194.8724 | 210.3754 | 240.6587 | nan | 162.0399 | 162.8161 | | mixnet_l | 128 | 186.683 | 201.8745 | 231.3633 | nan | 157.3799 | 158.1411 | | convit_base | 64 | 181.2498 | 181.8345 | 217.5178 | 146.4827 | 132.8497 | 129.2316 | | pit_b_224 | 64 | 154.9332 | 155.4261 | 188.3511 | 159.0107 | 114.8505 | 115.2567 | | dla102 | 128 | 178.3358 | 179.8599 | 213.8806 | 135.8186 | 112.9657 | 115.0798 | | poolformer_m36 | 64 | 148.6818 | 148.8301 | 184.2216 | nan | 112.1766 | 114.7884 | | cait_m36_384 | 4 | 166.0674 | 189.273 | nan | nan | 110.4874 | 113.9671 | | resnest101e | 64 | 163.2773 | 173.4498 | 199.7942 | nan | 107.5042 | 113.8459 | | gluon_inception_v3 | 128 | 161.189 | 161.8985 | 188.6543 | 140.8407 | 107.1383 | 109.5614 | | inception_v3 | 128 | 160.8887 | 161.4577 | 188.1659 | 140.4689 | 107.0772 | 109.3848 | | adv_inception_v3 | 128 | 160.9372 | 162.0016 | 188.695 | 140.9753 | 107.0632 | 109.5254 | | beit_base_patch16_224 | 64 | 135.0013 | 137.9937 | nan | nan | 105.5047 | 106.676 | | swsl_resnext101_32x16d | 32 | 117.8315 | 120.0715 | 145.8304 | nan | 104.0101 | 111.5879 | | vit_base_patch16_224 | 64 | 120.7583 | 121.1663 | 144.342 | 132.2261 | 100.7066 | 102.0855 | | res2net50_14w_8s | 128 | 145.9321 | 146.5905 | 180.0589 | 144.4657 | 99.7081 | 104.9877 | | swin_base_patch4_window7_224 | 64 | 147.177 | 153.4413 | nan | nan | 98.3562 | 100.3591 | | res2next50 | 128 | 138.1382 | 138.5115 | 166.313 | 121.9903 | 97.8966 | 102.7908 | | dpn107 | 32 | 118.2478 | 115.5224 | 142.5973 | nan | 93.007 | 92.0145 | | mixer_b16_224 | 128 | 118.6522 | 118.6356 | 147.7478 | 131.9468 | 91.9443 | 93.0647 | | dm_nfnet_f0 | 128 | 131.1278 | 130.9349 | 148.9232 | 141.9711 | 87.221 | 91.8363 | | gmlp_s16_224 | 128 | 136.2306 | 137.3141 | 173.1269 | 134.6209 | 86.5007 | 89.0014 | | eca_botnext26ts_256 | 128 | 112.0124 | 135.2536 | 163.691 | 102.4387 | 86.0635 | 86.4367 | | gluon_xception65 | 32 | 97.9529 | 98.7335 | 129.6594 | nan | 84.9418 | 86.8637 | | fbnetv3_b | 128 | 120.9923 | 125.7692 | 148.6435 | nan | 84.8281 | 86.4507 | | jx_nest_base | 32 | 118.9289 | 119.6368 | 149.311 | nan | 84.7742 | 87.6161 | | volo_d1_224 | 64 | 134.6694 | 134.9775 | 159.0741 | nan | 84.1126 | 85.8951 | | visformer_small | 128 | 98.083 | 97.8843 | 116.6173 | nan | 79.7432 | 83.362 | | botnet26t_256 | 128 | 106.0198 | 106.3347 | 127.745 | 81.1725 | 78.1684 | 77.9361 | | res2net101_26w_4s | 64 | 127.6017 | 120.964 | 126.4756 | nan | 77.8318 | 94.6271 | | gmixer_24_224 | 128 | 119.8805 | 136.279 | 166.401 | 130.039 | 76.9061 | 80.2235 | | crossvit_9_240 | 128 | 109.3072 | 109.7657 | 130.3191 | 118.6545 | 75.5394 | 76.9574 | | twins_pcpvt_base | 64 | 125.0295 | 135.6559 | 137.823 | nan | 73.9324 | 78.3689 | | deit_base_distilled_patch16_224 | 64 | 94.2363 | 94.9786 | 118.1687 | 96.4618 | 73.3593 | 74.4074 | | gernet_l | 128 | 79.7897 | 80.4731 | 98.8127 | 71.0728 | 70.9133 | 70.1418 | | coat_lite_mini | 128 | 115.8724 | 116.5157 | 136.8266 | 100.549 | 70.12 | 70.872 | | cspdarknet53 | 64 | 96.1418 | 96.6984 | 119.6801 | 79.8584 | 69.0712 | 68.2867 | | rexnet_100 | 128 | 90.8068 | 103.0944 | 127.0571 | nan | 68.6523 | 68.6545 | | repvgg_a2 | 128 | 79.5925 | 80.2347 | 94.7347 | 70.2912 | 68.2308 | 67.2124 | | nfnet_l0 | 128 | 105.5165 | 130.6724 | 148.3881 | 124.5357 | 68.0984 | 72.6384 | | sebotnet33ts_256 | 64 | 83.2758 | 95.9662 | 118.327 | 82.9965 | 67.1227 | 66.8513 | | tf_efficientnet_b0 | 128 | 90.5188 | 108.2452 | 131.2481 | 91.9538 | 64.8267 | 64.3374 | | mobilevit_s | 64 | 90.0666 | 107.4926 | 133.3245 | nan | 63.0709 | 63.3077 | | fbnetc_100 | 128 | 87.9808 | 88.7171 | 105.5111 | 72.7112 | 61.9021 | 60.9224 | | xcit_large_24_p8_224 | 5 | 124.838 | nan | nan | nan | 60.2389 | 72.8432 | | tinynet_a | 128 | 78.7391 | 90.2587 | 110.5799 | 93.0059 | 58.0636 | 58.2381 | | spnasnet_100 | 128 | 76.5876 | 77.2466 | 93.1281 | 66.1509 | 52.8896 | 51.9946 | | resmlp_12_224 | 128 | 68.1829 | 68.2369 | 87.2823 | 45.8592 | 49.5315 | 49.9044 | | ese_vovnet19b_dw | 128 | 67.878 | 68.3525 | 85.8418 | 58.3454 | 47.8939 | 47.7877 | | mnasnet_100 | 128 | 70.0336 | 70.7316 | 84.5374 | 55.4273 | 46.6834 | 45.9789 | | ghostnet_100 | 128 | 96.2671 | 98.0873 | 107.538 | 95.1889 | 46.0747 | 57.1455 | | mobilenetv2_100 | 128 | 67.529 | 68.3136 | 89.0367 | 56.7875 | 45.7369 | 45.1937 | | selecsls42b | 128 | 62.7482 | 63.0691 | 74.4788 | 48.9765 | 43.5806 | 44.5191 | | mobilenetv3_large_100 | 128 | 65.777 | 66.5782 | 80.4633 | 63.8538 | 43.3699 | 44.0962 | | regnety_002 | 128 | 54.0596 | 55.7532 | 46.9872 | 61.0882 | 25.4154 | 37.1898 | | lcnet_050 | 128 | 33.9255 | 34.839 | 38.6885 | 31.679 | 16.3399 | 20.6194 | | eca_halonext26ts | 128 | 116.0655 | 139.5215 | 167.9839 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/torchbench_amp.png : ![](https://i.imgur.com/0DXVoJJ.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/9FeaMaO.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/PkQuR3o.png)

anijain2305 commented 1 year ago

(One-off) Performance Dashboard for amp precision

We changed the batch sizes and sequence lengths of HF models to more accurately represent these models. This dashboard run is a one-off experiment to get the new speedups.

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+-------------+
|        Compiler        | huggingface |
+------------------------+-------------+
|         eager          | 93%, 43/46  |
|        inductor        | 83%, 38/46  |
| inductor_no_cudagraphs | 85%, 39/46  |
+------------------------+-------------+

Geometric mean speedup

+------------------------+-------------+
|        Compiler        | huggingface |
+------------------------+-------------+
|         eager          |    1.00x    |
|        inductor        |    1.56x    |
| inductor_no_cudagraphs |    1.51x    |
+------------------------+-------------+

Mean compilation time (seconds)

+------------------------+-------------+
|        Compiler        | huggingface |
+------------------------+-------------+
|         eager          |    2.98     |
|        inductor        |    38.38    |
| inductor_no_cudagraphs |    33.29    |
+------------------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+-------------+
|        Compiler        | huggingface |
+------------------------+-------------+
|         eager          |    1.00x    |
|        inductor        |    0.92x    |
| inductor_no_cudagraphs |    1.07x    |
+------------------------+-------------+

Summary Statistics Diff

Warnings

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_amp_332 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_amp_119 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_amp_332 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_amp_119 Accuracy regressions ~~~ +------------------------+-----------------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------------------+-------------+---------------+ | inductor | MT5ForConditionalGeneration | pass | fail_accuracy | | inductor_no_cudagraphs | MT5ForConditionalGeneration | pass | fail_accuracy | +------------------------+-----------------------------+-------------+---------------+ ~~~ No regressions found.

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+----------+------------------------+ | name | bs | eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 2.2334 | 2.1974 | | MobileBertForMaskedLM | 64 | 1.0236 | 2.1551 | 1.4366 | | ElectraForQuestionAnswering | 64 | 1.0004 | 2.1143 | 2.0666 | | MT5ForConditionalGeneration | 16 | 1.024 | 2.0936 | 1.7912 | | OPTForCausalLM | 2 | 1.0001 | 2.0051 | 1.9961 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 1.857 | 1.8093 | | XLNetLMHeadModel | 8 | 1.0011 | 1.8511 | 1.8522 | | RobertaForQuestionAnswering | 16 | 0.9999 | 1.8314 | 1.7871 | | BertForQuestionAnswering | 16 | 1.0003 | 1.8287 | 1.7809 | | ElectraForCausalLM | 32 | 0.9995 | 1.8285 | 1.8308 | | RobertaForCausalLM | 16 | 1.0004 | 1.7058 | 1.6888 | | AlbertForQuestionAnswering | 4 | 1.0001 | 1.6742 | 1.6633 | | PLBartForConditionalGeneration | 4 | 1.0015 | 1.6672 | 1.6566 | | AlbertForMaskedLM | 4 | 1.0001 | 1.6617 | 1.6543 | | DistillGPT2 | 16 | 0.9998 | 1.6615 | 1.6903 | | MobileBertForQuestionAnswering | 128 | 1.0256 | 1.6529 | 1.3969 | | LayoutLMForMaskedLM | 16 | 1.0003 | 1.6446 | 1.6348 | | MegatronBertForQuestionAnswering | 8 | 0.9993 | 1.6254 | 1.5893 | | PLBartForCausalLM | 8 | 1.0005 | 1.6219 | 1.6424 | | BertForMaskedLM | 16 | 1.0003 | 1.6196 | 1.5997 | | T5ForConditionalGeneration | 4 | 1.0008 | 1.613 | 1.6745 | | T5Small | 4 | 1.0062 | 1.6046 | 1.6729 | | MegatronBertForCausalLM | 4 | 1.0018 | 1.5734 | 1.519 | | CamemBert | 16 | 1.0005 | 1.5561 | 1.5482 | | DistilBertForQuestionAnswering | 256 | 0.9999 | 1.5072 | 1.4981 | | YituTechConvBert | 16 | 1.0006 | 1.474 | 1.4493 | | MBartForConditionalGeneration | 2 | 1.0003 | 1.446 | 1.4147 | | MBartForCausalLM | 4 | 1.0003 | 1.4458 | 1.4496 | | BartForConditionalGeneration | 2 | 1.0012 | 1.4373 | 1.4043 | | BartForCausalLM | 4 | 1.0008 | 1.4363 | 1.436 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0009 | 1.3922 | 1.4002 | | M2M100ForConditionalGeneration | 16 | 1.0497 | 1.3702 | 1.3683 | | Speech2Text2ForCausalLM | 256 | 0.9987 | 1.3367 | 1.3792 | | XGLMForCausalLM | 8 | 1.0148 | 1.3316 | 1.2233 | | TrOCRForCausalLM | 32 | 1.0 | 1.2875 | 1.288 | | PegasusForConditionalGeneration | 32 | 0.9993 | 1.2588 | 1.2463 | | DistilBertForMaskedLM | 128 | 0.9999 | 1.2535 | 1.27 | | BlenderbotSmallForCausalLM | 64 | 0.9991 | 1.243 | 1.2565 | | Reformer | 16 | 1.0 | 1.1808 | 1.1958 | | PegasusForCausalLM | 32 | 0.9991 | 1.1657 | 1.1691 | | DebertaForQuestionAnswering | 8 | 0.9957 | 1.1404 | 1.2817 | | DebertaForMaskedLM | 4 | 0.9034 | 1.1283 | 1.0111 | | DebertaV2ForMaskedLM | 1 | 0.8888 | 1.0459 | 0.7934 | | DebertaV2ForQuestionAnswering | 2 | 0.9394 | 1.0005 | 0.9044 | | BlenderbotForCausalLM | 4 | 1.002 | 0.0 | 1.1318 | | AllenaiLongformerBase | 4 | 1.0003 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+---------------+------------------------+ | name | bs | eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+---------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | | T5Small | 1 | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | fail_to_run | pass | | MBartForConditionalGeneration | 1 | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | fail_to_run | fail_to_run | | DebertaV2ForMaskedLM | 1 | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | fail_accuracy | fail_accuracy | | MT5ForConditionalGeneration | 1 | pass | fail_accuracy | fail_accuracy | | BlenderbotForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+-------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+----------+------------------------+ | name | bs | eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+----------+------------------------+ | XLNetLMHeadModel | 8 | 4.9613 | 187.0385 | 134.3723 | | DebertaV2ForQuestionAnswering | 2 | 8.5691 | 178.346 | 53.6695 | | DebertaV2ForMaskedLM | 1 | 8.6028 | 176.9551 | 53.9938 | | DebertaForQuestionAnswering | 8 | 4.9051 | 95.9909 | 37.3566 | | DebertaForMaskedLM | 4 | 4.8628 | 95.2405 | 36.503 | | XGLMForCausalLM | 8 | 3.106 | 77.0533 | 74.3686 | | MobileBertForQuestionAnswering | 128 | 10.1267 | 74.5319 | 72.6252 | | MobileBertForMaskedLM | 64 | 10.0809 | 71.7365 | 69.847 | | M2M100ForConditionalGeneration | 16 | 3.9241 | 63.4812 | 54.3245 | | MT5ForConditionalGeneration | 16 | 4.2024 | 56.9742 | 55.4957 | | PegasusForConditionalGeneration | 32 | 3.717 | 55.1608 | 50.6406 | | BartForConditionalGeneration | 2 | 3.9296 | 54.4663 | 52.9711 | | MBartForConditionalGeneration | 2 | 3.9132 | 53.4826 | 52.749 | | MegatronBertForQuestionAnswering | 8 | 4.0669 | 43.877 | 41.393 | | MegatronBertForCausalLM | 4 | 4.1484 | 42.9774 | 42.0441 | | YituTechConvBert | 16 | 2.7723 | 42.0768 | 40.9871 | | BlenderbotSmallForConditionalGeneration | 64 | 2.5376 | 36.2448 | 35.4186 | | T5ForConditionalGeneration | 4 | 2.7409 | 30.4526 | 29.759 | | T5Small | 4 | 2.7395 | 30.0375 | 29.4543 | | PLBartForConditionalGeneration | 4 | 2.0915 | 29.7522 | 28.5098 | | LayoutLMForSequenceClassification | 16 | 1.9478 | 28.3268 | 27.702 | | PegasusForCausalLM | 32 | 1.5507 | 24.9132 | 23.2426 | | ElectraForCausalLM | 32 | 1.9471 | 24.5517 | 24.0295 | | MBartForCausalLM | 4 | 1.5598 | 23.3789 | 22.7771 | | BartForCausalLM | 4 | 1.5526 | 22.5939 | 21.8705 | | LayoutLMForMaskedLM | 16 | 1.9556 | 22.551 | 22.0081 | | BertForMaskedLM | 16 | 1.9575 | 22.378 | 21.405 | | RobertaForCausalLM | 16 | 1.9702 | 22.0967 | 21.5283 | | TrOCRForCausalLM | 32 | 1.4398 | 21.9883 | 20.9558 | | ElectraForQuestionAnswering | 64 | 1.9535 | 21.8811 | 21.1699 | | BertForQuestionAnswering | 16 | 1.9803 | 21.3653 | 20.48 | | CamemBert | 16 | 1.9428 | 21.0836 | 20.3612 | | RobertaForQuestionAnswering | 16 | 1.9726 | 20.3014 | 19.3736 | | OPTForCausalLM | 2 | 1.5329 | 20.0407 | 19.4769 | | GPT2ForSequenceClassification | 4 | 1.7074 | 19.2432 | 18.8297 | | AlbertForMaskedLM | 4 | 1.6901 | 18.7573 | 18.3081 | | AlbertForQuestionAnswering | 4 | 1.7003 | 18.365 | 17.9981 | | BlenderbotSmallForCausalLM | 64 | 1.0481 | 16.8901 | 16.6689 | | Reformer | 16 | 1.5553 | 16.6821 | 14.1788 | | Speech2Text2ForCausalLM | 256 | 0.8644 | 15.0604 | 13.0639 | | PLBartForCausalLM | 8 | 0.8719 | 14.8384 | 14.8013 | | DistillGPT2 | 16 | 0.9185 | 13.229 | 13.1179 | | DistilBertForMaskedLM | 128 | 0.8169 | 12.4229 | 12.133 | | DistilBertForQuestionAnswering | 256 | 0.8509 | 11.8337 | 11.3955 | | BlenderbotForCausalLM | 4 | 2.8764 | nan | 43.2461 | | AllenaiLongformerBase | 4 | 4.8228 | nan | nan | +-----------------------------------------+-----+---------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+----------+------------------------+ | name | bs | eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+----------+------------------------+ | OPTForCausalLM | 2 | 1.0004 | 1.1425 | 1.2019 | | AlbertForQuestionAnswering | 4 | 1.0 | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 0.9999 | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 1.0775 | 1.1712 | | MBartForCausalLM | 4 | 1.0 | 1.0747 | 1.1342 | | BartForCausalLM | 4 | 1.0 | 1.0568 | 1.1144 | | XLNetLMHeadModel | 8 | 0.9999 | 1.0303 | 1.0303 | | ElectraForQuestionAnswering | 64 | 1.0016 | 1.017 | 1.0704 | | MBartForConditionalGeneration | 2 | 1.0 | 1.0078 | 1.2186 | | LayoutLMForSequenceClassification | 16 | 1.004 | 1.0044 | 1.0277 | | PegasusForConditionalGeneration | 32 | 0.9979 | 1.0039 | 1.1394 | | RobertaForQuestionAnswering | 16 | 1.004 | 1.0036 | 1.0618 | | BertForQuestionAnswering | 16 | 1.004 | 1.0029 | 1.0617 | | TrOCRForCausalLM | 32 | 0.9998 | 0.9842 | 1.0414 | | BartForConditionalGeneration | 2 | 1.0 | 0.9839 | 1.1976 | | DistilBertForQuestionAnswering | 256 | 1.0112 | 0.9806 | 1.0864 | | DistillGPT2 | 16 | 1.0 | 0.9755 | 1.0618 | | PegasusForCausalLM | 32 | 0.9749 | 0.9708 | 1.0342 | | T5Small | 4 | 0.9998 | 0.9662 | 1.1856 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9662 | 1.1856 | | PLBartForConditionalGeneration | 4 | 1.0 | 0.9656 | 1.0849 | | YituTechConvBert | 16 | 0.9999 | 0.9596 | 1.0043 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.9593 | 1.1105 | | MegatronBertForQuestionAnswering | 8 | 1.0006 | 0.9562 | 1.0239 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9481 | 0.9848 | | BertForMaskedLM | 16 | 1.0001 | 0.9481 | 0.9849 | | RobertaForCausalLM | 16 | 1.0 | 0.9475 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9446 | 0.983 | | MT5ForConditionalGeneration | 16 | 1.0015 | 0.9203 | 1.0032 | | PLBartForCausalLM | 8 | 0.9999 | 0.9162 | 0.9889 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9121 | 1.0221 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8716 | 0.9439 | | Speech2Text2ForCausalLM | 256 | 0.9668 | 0.8672 | 0.9793 | | ElectraForCausalLM | 32 | 0.9999 | 0.8601 | 0.9327 | | M2M100ForConditionalGeneration | 16 | 1.0002 | 0.8468 | 1.0449 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.846 | 0.9426 | | XGLMForCausalLM | 8 | 0.9917 | 0.7979 | 0.9823 | | MobileBertForMaskedLM | 64 | 0.9999 | 0.6698 | 0.9649 | | DebertaV2ForMaskedLM | 1 | 0.9982 | 0.6117 | 0.9912 | | MobileBertForQuestionAnswering | 128 | 1.0159 | 0.5988 | 0.8126 | | Reformer | 16 | 0.9738 | 0.5813 | 0.9765 | | DebertaV2ForQuestionAnswering | 2 | 0.9795 | 0.5266 | 0.9885 | | DebertaForMaskedLM | 4 | 0.9982 | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 0.3071 | 1.1616 | | BlenderbotForCausalLM | 4 | 1.0002 | nan | 0.9343 | | AllenaiLongformerBase | 4 | 0.9984 | nan | nan | +-----------------------------------------+-----+--------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+----------+------------------------+ | name | bs | eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+----------+------------------------+ | Reformer | 16 | 299.454 | 254.0543 | 250.4981 | | AlbertForMaskedLM | 4 | 267.1382 | 160.7691 | 161.4885 | | AlbertForQuestionAnswering | 4 | 265.1046 | 158.4007 | 159.2928 | | XLNetLMHeadModel | 8 | 275.0372 | 148.0387 | 148.166 | | PegasusForConditionalGeneration | 32 | 139.0743 | 110.5775 | 111.7656 | | TrOCRForCausalLM | 32 | 136.8119 | 106.3152 | 106.1294 | | DebertaV2ForQuestionAnswering | 2 | 101.3922 | 95.1388 | 105.0077 | | BartForConditionalGeneration | 2 | 133.765 | 93.5861 | 95.5794 | | MBartForConditionalGeneration | 2 | 133.8285 | 92.6951 | 94.941 | | YituTechConvBert | 16 | 133.4038 | 90.2775 | 91.7192 | | MegatronBertForQuestionAnswering | 8 | 140.8927 | 86.7515 | 88.5807 | | MobileBertForQuestionAnswering | 128 | 136.5729 | 83.4223 | 99.0028 | | BartForCausalLM | 4 | 111.779 | 77.7521 | 77.7712 | | MBartForCausalLM | 4 | 111.9142 | 77.2773 | 77.0949 | | DebertaV2ForMaskedLM | 1 | 91.0149 | 77.1094 | 102.5667 | | BlenderbotSmallForConditionalGeneration | 64 | 106.7314 | 76.9502 | 76.6337 | | CamemBert | 16 | 117.9447 | 75.7728 | 76.0605 | | M2M100ForConditionalGeneration | 16 | 89.6193 | 69.5873 | 69.936 | | PLBartForCausalLM | 8 | 111.9812 | 69.2046 | 68.2971 | | PLBartForConditionalGeneration | 4 | 115.0229 | 69.1518 | 69.603 | | DistilBertForQuestionAnswering | 256 | 102.7278 | 68.1873 | 68.6147 | | LayoutLMForMaskedLM | 16 | 111.8698 | 68.0072 | 68.4826 | | BertForMaskedLM | 16 | 109.4799 | 67.5049 | 68.3219 | | DistilBertForMaskedLM | 128 | 84.1116 | 67.0676 | 66.1619 | | RobertaForCausalLM | 16 | 114.4949 | 67.0637 | 67.6437 | | DebertaForQuestionAnswering | 8 | 74.8492 | 65.6254 | 58.0744 | | MobileBertForMaskedLM | 64 | 135.0426 | 64.6374 | 97.3842 | | T5Small | 4 | 101.0668 | 62.9117 | 60.3998 | | T5ForConditionalGeneration | 4 | 100.9635 | 62.9039 | 60.3231 | | DistillGPT2 | 16 | 104.1086 | 62.6267 | 61.577 | | OPTForCausalLM | 2 | 122.6294 | 61.2126 | 61.5297 | | PegasusForCausalLM | 32 | 68.7375 | 58.6923 | 58.6298 | | ElectraForQuestionAnswering | 64 | 114.2575 | 54.063 | 55.2486 | | MegatronBertForCausalLM | 4 | 85.0683 | 53.9654 | 56.1493 | | LayoutLMForSequenceClassification | 16 | 96.8657 | 52.232 | 53.5407 | | RobertaForQuestionAnswering | 16 | 94.8538 | 51.8078 | 53.1693 | | BertForQuestionAnswering | 16 | 94.483 | 51.7719 | 53.0635 | | XGLMForCausalLM | 8 | 67.8391 | 51.5611 | 56.6399 | | ElectraForCausalLM | 32 | 87.2095 | 47.5046 | 47.4558 | | DebertaForMaskedLM | 4 | 58.9701 | 46.9701 | 52.6506 | | BlenderbotSmallForCausalLM | 64 | 57.9597 | 46.4331 | 45.9489 | | GPT2ForSequenceClassification | 4 | 87.0995 | 39.0091 | 39.6415 | | Speech2Text2ForCausalLM | 256 | 52.712 | 39.0005 | 38.1518 | | MT5ForConditionalGeneration | 16 | 78.5297 | 38.6327 | 45.9931 | | BlenderbotForCausalLM | 4 | 89.4332 | nan | 78.9196 | | AllenaiLongformerBase | 4 | 182.177 | nan | nan | +-----------------------------------------+-----+----------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/Z1RvHAv.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 95%, 53/56 | 98%, 42/43  | 98%, 60/61  |
|       aot_eager        | 91%, 51/56 | 95%, 41/43  | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 41/56 | 72%, 31/43  | 46%, 28/61  |
|    nvprims_nvfuser     | 75%, 42/56 | 60%, 26/43  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 91%, 39/43  | 93%, 57/61  |
| inductor_no_cudagraphs | 91%, 51/56 | 91%, 39/43  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.02x    |    1.14x    |
|        inductor        |   1.52x    |    1.30x    |    1.24x    |
| inductor_no_cudagraphs |   1.23x    |    1.23x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.85    |    2.37     |    2.02     |
|       aot_eager        |    5.54    |    7.49     |    7.17     |
|     aot_cudagraphs     |    7.51    |    14.10    |    13.21    |
|    nvprims_nvfuser     |   65.08    |    98.89    |   147.85    |
|        inductor        |   30.45    |    30.45    |    37.25    |
| inductor_no_cudagraphs |   29.58    |    26.36    |    35.77    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.91x    |    1.00x    |    0.95x    |
|        inductor        |   0.83x    |    0.71x    |    0.98x    |
| inductor_no_cudagraphs |   0.97x    |    0.97x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | nvidia_deeprecommender | 0.9033 | 0.9646 | | torchbench | dlrm | 0.0 | 1.048 | | torchbench | hf_GPT2_large | 0.0 | 1.475 | | torchbench | hf_T5 | 0.0 | 1.5749 | | torchbench | tacotron2 | 0.0 | 0.9272 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5448 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 367.6707 | 365.1366 | | torchbench | timm_efficientdet | 130.8691 | 126.8008 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0022 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8738 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | LearningToPaint | 0.7133 | 0.913 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4061 | 0.4214 | | torchbench | dcgan | 0.2564 | 0.2576 | | torchbench | dlrm | nan | 0.7306 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8169 | 0.9378 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.7967 | 0.8799 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | MT5ForConditionalGeneration | 0.7628 | 0.9397 | | huggingface | M2M100ForConditionalGeneration | 0.7616 | 1.0075 | | huggingface | GoogleFnet | 0.7589 | 0.969 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9645 | | huggingface | CamemBert | 0.7485 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.724 | 0.9375 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6761 | 0.8845 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8525 | 1.0754 | | timm_models | coat_lite_mini | 0.8441 | 1.0596 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6493 | 0.869 | | timm_models | repvgg_a2 | 0.5319 | 0.8172 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/jqw9Yra.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/BGdkvPT.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0033 | 1.0243 | 2.263 | 0.7818 | 5.1109 | 1.2832 | | timm_efficientdet | 1 | 0.9833 | 0.8904 | 1.8273 | 0.7564 | 4.3275 | 1.5114 | | functorch_dp_cifar10 | 64 | 1.0081 | 1.0247 | 2.0382 | 0.0 | 3.6424 | 1.2047 | | timm_vision_transformer | 8 | 1.0044 | 0.9431 | 1.662 | 0.6528 | 2.6691 | 1.389 | | BERT_pytorch | 16 | 1.0135 | 0.8851 | 1.1157 | 0.954 | 2.1549 | 2.0891 | | resnext50_32x4d | 8 | 1.0009 | 1.1079 | 1.2231 | 0.8065 | 2.066 | 1.1972 | | drq | 1 | 1.0012 | 0.874 | 1.6267 | 0.7358 | 2.0486 | 1.0776 | | mobilenet_v3_large | 32 | 1.0037 | 1.1171 | 1.045 | 0.8505 | 2.0111 | 1.3418 | | dcgan | 32 | 0.9892 | 1.0326 | 1.2603 | 0.8013 | 1.94 | 1.002 | | hf_T5_large | 2 | 1.0239 | 0.9048 | 0.0 | 0.0 | 1.9379 | 1.6708 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9937 | 1.025 | 1.2218 | 0.8593 | 1.8833 | 1.4851 | | resnet18 | 16 | 1.0034 | 1.1176 | 1.1748 | 0.8788 | 1.8653 | 1.2478 | | pytorch_struct | 200 | 0.9964 | 0.7549 | 0.8851 | 0.7785 | 1.8226 | 1.158 | | lennard_jones | 1000 | 0.964 | 0.8697 | 1.0422 | 0.6935 | 1.8112 | 0.9603 | | squeezenet1_1 | 32 | 0.999 | 1.0189 | 1.0535 | 0.8732 | 1.7235 | 1.2624 | | hf_Albert | 8 | 1.0007 | 0.9959 | 0.7528 | 1.5565 | 1.6458 | 1.6383 | | hf_GPT2 | 4 | 1.0115 | 0.9783 | 0.7428 | 0.402 | 1.5698 | 1.5065 | | shufflenet_v2_x1_0 | 128 | 1.0004 | 1.0643 | 0.8148 | 0.9019 | 1.5309 | 1.4076 | | timm_resnest | 32 | 0.9996 | 1.0031 | 0.8046 | 1.1635 | 1.5166 | 1.4534 | | timm_nfnet | 128 | 0.9997 | 1.0002 | 0.0 | 1.1345 | 1.4735 | 1.4201 | | mnasnet1_0 | 32 | 0.9973 | 1.1099 | 0.8665 | 0.9203 | 1.4691 | 1.2478 | | soft_actor_critic | 256 | 0.9912 | 0.819 | 1.1013 | 0.702 | 1.4328 | 0.9676 | | mobilenet_v2 | 96 | 0.9998 | 0.9992 | 0.7302 | 1.3367 | 1.4308 | 1.4077 | | fastNLP_Bert | 6 | 0.9985 | 0.9692 | 0.7532 | 1.1565 | 1.4216 | 1.3958 | | speech_transformer | 32 | 1.0019 | 0.8989 | 1.4069 | 0.7629 | 1.4137 | 1.4206 | | mobilenet_v2_quantized_qat | 96 | 1.0014 | 0.9793 | 0.0 | 1.4599 | 1.4042 | 1.408 | | resnet50_quantized_qat | 32 | 1.0011 | 0.9728 | 0.0 | 1.148 | 1.3619 | 1.3631 | | timm_efficientnet | 32 | 0.9543 | 0.8033 | 0.7006 | 0.8161 | 1.3358 | 1.1905 | | LearningToPaint | 96 | 1.0008 | 1.0506 | 0.8676 | 0.9763 | 1.317 | 1.2088 | | resnet152 | 32 | 1.0023 | 1.069 | 0.81 | 0.8975 | 1.3091 | 1.203 | | pytorch_stargan | 16 | 0.9993 | 1.0776 | 0.9388 | 0.0 | 1.2673 | 1.2268 | | hf_Bert | 4 | 1.0297 | 1.0017 | 0.733 | 0.8526 | 1.2164 | 1.1904 | | hf_Bart | 4 | 1.0122 | 0.9759 | 0.7436 | 0.8637 | 1.209 | 1.207 | | pytorch_unet | 1 | 1.0001 | 0.2827 | 0.0 | 0.0 | 1.2074 | 1.1942 | | resnet50 | 32 | 0.9991 | 0.9926 | 0.7598 | 0.9816 | 1.2023 | 1.169 | | hf_DistilBert | 8 | 1.0003 | 0.9567 | 0.6871 | 0.5159 | 1.1735 | 1.1811 | | vgg16 | 64 | 0.9999 | 0.9983 | 0.8587 | 0.9977 | 1.1733 | 1.1676 | | alexnet | 128 | 0.9995 | 0.9975 | 0.8034 | 1.0064 | 1.1617 | 1.1639 | | Super_SloMo | 6 | 0.9996 | 0.2432 | 0.0 | 0.2487 | 1.1487 | 1.1326 | | hf_Reformer | 4 | 0.9982 | 1.0021 | 0.9912 | 0.7338 | 1.1321 | 1.1412 | | timm_regnet | 32 | 0.9656 | 0.9615 | 0.7794 | 1.0981 | 1.1306 | 1.0899 | | yolov3 | 16 | 1.0002 | 0.9947 | 0.791 | 1.1518 | 1.0924 | 1.0783 | | Background_Matting | 4 | 1.0002 | 0.1926 | 0.0 | 0.0 | 1.0817 | 1.0733 | | attention_is_all_you_need_pytorch | 256 | 1.0002 | 0.9703 | 0.757 | 0.9558 | 1.0654 | 1.0513 | | timm_vision_transformer_large | 8 | 1.0001 | 0.9939 | 0.0 | 0.0 | 1.0468 | 1.0325 | | tts_angular | 64 | 0.9901 | 0.9611 | 0.9927 | 0.972 | 1.0136 | 1.0101 | | timm_vovnet | 32 | 0.9113 | 0.9037 | 0.7127 | 0.9026 | 1.0064 | 1.0288 | | demucs | 4 | 1.0002 | 0.9992 | 1.0005 | 0.9995 | 0.9992 | 1.0003 | | nvidia_deeprecommender | 256 | 0.9995 | 0.9627 | 0.5848 | 0.9759 | 0.9033 | 0.9646 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.048 | | hf_GPT2_large | 4 | 0.9999 | 0.9803 | 0.0 | 0.0 | 0.0 | 1.475 | | hf_T5 | 8 | 0.9997 | 0.9477 | 0.0 | 1.1691 | 0.0 | 1.5749 | | tacotron2 | 64 | 0.9819 | 0.8715 | 0.0 | 0.7723 | 0.0 | 0.9272 | | hf_BigBird | 2 | 0.9821 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.03 | 7.1394 | 10.2538 | 116.9277 | 367.6707 | 365.1366 | | timm_efficientdet | 1 | 19.623 | 33.1364 | 66.8026 | 488.6501 | 130.8691 | 126.8008 | | hf_T5_large | 2 | 14.1664 | 34.3923 | nan | nan | 112.567 | 111.4834 | | timm_vision_transformer_large | 8 | 2.5323 | 11.3254 | nan | nan | 54.104 | 52.1028 | | resnet152 | 32 | 2.4927 | 11.009 | 17.8597 | 191.2202 | 44.1625 | 41.6233 | | densenet121 | 4 | 2.1745 | 9.9688 | 15.7574 | 167.0773 | 44.0329 | 42.7987 | | attention_is_all_you_need_pytorch | 256 | 1.2174 | 5.6129 | 8.9467 | 112.5204 | 42.9197 | 41.8631 | | timm_resnest | 32 | 0.5908 | 2.0409 | 3.0736 | 59.7818 | 40.442 | 38.7649 | | timm_vision_transformer | 8 | 0.833 | 3.3828 | 5.0803 | 72.7308 | 31.2301 | 30.4972 | | timm_nfnet | 128 | 2.0669 | 6.1874 | nan | 155.4066 | 29.8935 | 28.3322 | | hf_Bart | 4 | 1.684 | 6.8645 | 10.5241 | 120.0428 | 29.3632 | 28.3382 | | mobilenet_v2_quantized_qat | 96 | 1.3294 | 7.4437 | nan | 180.8806 | 28.3847 | 28.1299 | | resnet50_quantized_qat | 32 | 1.2298 | 7.0649 | nan | 168.1425 | 28.1922 | 28.1628 | | BERT_pytorch | 16 | 1.5471 | 6.0392 | 9.0476 | 81.3306 | 27.2319 | 27.187 | | fastNLP_Bert | 6 | 1.5476 | 5.4979 | 8.6635 | 83.1036 | 26.7412 | 25.2428 | | speech_transformer | 32 | 1.6365 | 7.0398 | 26.6333 | 134.1765 | 24.533 | 22.7907 | | timm_regnet | 32 | 2.2716 | 6.6273 | 17.1212 | 110.9387 | 24.1519 | 23.0537 | | mobilenet_v3_large | 32 | 0.9124 | 3.9052 | 5.8051 | 101.1538 | 23.6143 | 22.8111 | | pytorch_stargan | 16 | 0.4481 | 1.6989 | 2.4593 | nan | 23.2963 | 24.2148 | | timm_efficientnet | 32 | 1.7705 | 5.5896 | 13.7786 | 109.2714 | 23.2926 | 22.8452 | | pytorch_struct | 200 | 0.2817 | 0.6308 | 1.1767 | 4.2436 | 22.5655 | 21.7685 | | hf_Bert | 4 | 1.5802 | 5.2467 | 7.7618 | 83.7327 | 19.3026 | 18.5036 | | Super_SloMo | 6 | 1.0847 | 6.6233 | nan | 56.5804 | 18.9826 | 18.1727 | | mnasnet1_0 | 32 | 0.839 | 3.6368 | 5.314 | 70.2739 | 18.3489 | 18.6202 | | hf_Reformer | 4 | 1.5962 | 2.6544 | 4.9053 | 14.987 | 18.2909 | 15.5461 | | timm_vovnet | 32 | 1.5198 | 3.8565 | 8.8929 | 57.0786 | 18.0133 | 17.3124 | | shufflenet_v2_x1_0 | 128 | 0.9772 | 4.3215 | 6.3045 | 86.5996 | 17.9279 | 17.1845 | | resnet50 | 32 | 0.9161 | 3.7339 | 5.5744 | 78.1275 | 17.8483 | 17.4419 | | hf_GPT2 | 4 | 1.5071 | 4.9815 | 7.6062 | 64.2341 | 17.8148 | 16.9264 | | resnext50_32x4d | 8 | 0.908 | 3.8272 | 5.5881 | 64.8387 | 17.5041 | 16.694 | | hf_Albert | 8 | 1.2689 | 4.664 | 7.351 | 113.6915 | 17.2302 | 16.8024 | | Background_Matting | 4 | 0.7532 | 7.7985 | nan | nan | 16.8559 | 16.0203 | | mobilenet_v2 | 96 | 0.8273 | 3.7498 | 5.9868 | 97.1732 | 16.5429 | 16.3861 | | functorch_dp_cifar10 | 64 | 0.3188 | 1.1095 | 1.7829 | nan | 15.0116 | 15.1136 | | hf_DistilBert | 8 | 0.6578 | 2.6352 | 4.7367 | 44.775 | 11.9286 | 11.3756 | | resnet18 | 16 | 0.4417 | 1.5049 | 2.1576 | 29.9042 | 11.042 | 10.1848 | | pytorch_unet | 1 | 0.4582 | 2.6343 | nan | nan | 8.5066 | 8.4558 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.409 | 1.604 | 2.3073 | 31.0404 | 8.1424 | 7.993 | | LearningToPaint | 96 | 0.4444 | 1.5634 | 2.3989 | 38.9928 | 7.4861 | 6.5689 | | dcgan | 32 | 0.1735 | 0.365 | 0.5733 | 4.3724 | 5.9785 | 5.7327 | | drq | 1 | 0.3072 | 0.5156 | 0.7691 | 4.5227 | 4.0342 | 3.4847 | | squeezenet1_1 | 32 | 0.2339 | 0.6476 | 1.0717 | 4.3582 | 3.978 | 3.737 | | soft_actor_critic | 256 | 0.2101 | 0.3091 | 0.5206 | 1.6129 | 3.4603 | 2.7222 | | vgg16 | 64 | 0.1793 | 0.4816 | 0.8886 | 3.0957 | 3.4444 | 3.1995 | | nvidia_deeprecommender | 256 | 0.2041 | 0.3908 | 0.6318 | 4.5073 | 3.4005 | 3.0695 | | alexnet | 128 | 0.1477 | 0.3166 | 0.5731 | 2.9158 | 3.1142 | 2.6927 | | lennard_jones | 1000 | 0.1414 | 0.2468 | 0.3929 | 1.3723 | 2.0295 | 1.774 | | tts_angular | 64 | 0.1715 | 0.2153 | 0.3455 | 1.0773 | 1.9416 | 1.7838 | | demucs | 4 | 0.3003 | 0.3015 | 0.3003 | 0.3028 | 0.2096 | 0.2095 | | hf_GPT2_large | 4 | 5.2428 | 15.5995 | nan | nan | nan | 44.6558 | | tacotron2 | 64 | 4.7643 | 13.9678 | nan | 42.774 | nan | 42.2314 | | hf_T5 | 8 | 2.4745 | 7.8787 | nan | 76.5157 | nan | 27.2094 | | dlrm | 2048 | nan | nan | nan | nan | nan | 2.9248 | | hf_BigBird | 2 | 3.5025 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.582 | 1.582 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.2258 | 1.4874 | 1.4877 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.988 | 1.3107 | 1.3923 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.289 | 1.1368 | 1.1162 | 1.1442 | | Super_SloMo | 6 | 1.0024 | 0.902 | nan | 0.9455 | 1.1137 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2737 | 1.1209 | 1.0376 | 1.0454 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 1.0219 | 1.0958 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 0.9821 | 1.0223 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9404 | 1.0831 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9342 | 1.0395 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.9236 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9162 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.9117 | 1.105 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9067 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.7926 | 0.8982 | 1.0022 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3279 | 0.8098 | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8738 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8779 | 0.3388 | 1.0865 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3503 | 1.1284 | 0.826 | 1.0815 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3306 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | LearningToPaint | 96 | 0.9442 | 0.716 | 0.3385 | 0.6507 | 0.7133 | 0.913 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9995 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4061 | 0.4214 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1613 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1507 | | dlrm | 2048 | nan | nan | nan | nan | nan | 0.7306 | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 197.4191 | 198.8644 | nan | nan | 189.0686 | 191.1463 | | Background_Matting | 4 | 186.2605 | 967.0787 | nan | nan | 172.2964 | 173.715 | | timm_nfnet | 128 | 205.7675 | 205.9949 | nan | 181.6992 | 139.5623 | 144.756 | | hf_T5_large | 2 | 190.281 | 214.6148 | nan | nan | 118.2186 | 122.0592 | | mobilenet_v2_quantized_qat | 96 | 146.8292 | 151.142 | nan | 100.9128 | 105.2412 | 105.0999 | | Super_SloMo | 6 | 117.4759 | 482.2122 | nan | 471.7983 | 102.3455 | 103.4706 | | yolov3 | 16 | 102.048 | 102.4473 | 129.1563 | 88.62 | 93.5148 | 94.5516 | | vgg16 | 64 | 106.1451 | 106.1679 | 123.5529 | 106.2768 | 90.6065 | 91.0218 | | timm_regnet | 32 | 101.7089 | 101.6763 | 125.3839 | 88.7742 | 86.8458 | 90.0405 | | demucs | 4 | 77.5814 | 77.7687 | 77.5399 | 77.5985 | 77.7705 | 77.5694 | | resnet152 | 32 | 90.4763 | 85.4235 | 113.126 | 101.6596 | 73.6223 | 76.0897 | | hf_Reformer | 4 | 83.2578 | 83.0745 | 84.1505 | 113.2567 | 73.451 | 72.7982 | | resnet50_quantized_qat | 32 | 93.6861 | 96.0437 | nan | 81.528 | 69.6771 | 68.6344 | | attention_is_all_you_need_pytorch | 256 | 71.9535 | 73.9585 | 95.4352 | 75.1951 | 67.7388 | 68.5049 | | mobilenet_v2 | 96 | 71.2652 | 71.3449 | 97.5965 | 53.2635 | 49.8452 | 50.6675 | | pytorch_unet | 1 | 58.4169 | 206.627 | nan | nan | 48.4191 | 48.9639 | | hf_Bart | 4 | 54.4045 | 56.5979 | 74.4211 | 62.8877 | 45.7442 | 45.8791 | | hf_Albert | 8 | 74.9764 | 75.3414 | 99.8056 | 48.2598 | 45.6286 | 45.6781 | | fastNLP_Bert | 6 | 59.5822 | 61.8963 | 79.2684 | 51.3777 | 42.195 | 42.7574 | | speech_transformer | 32 | 49.3072 | 59.8413 | 35.5929 | 70.8486 | 39.7222 | 35.8781 | | timm_vovnet | 32 | 42.1781 | 42.6064 | 54.0008 | 42.592 | 38.2002 | 37.694 | | hf_GPT2 | 4 | 50.3774 | 51.0753 | 68.2263 | 124.7687 | 33.5795 | 33.396 | | hf_DistilBert | 8 | 38.8299 | 40.5414 | 56.5597 | 75.3031 | 33.0871 | 32.8825 | | hf_Bert | 4 | 37.9865 | 39.1605 | 53.4334 | 45.4847 | 32.4566 | 32.9753 | | timm_efficientdet | 1 | 140.984 | 153.9809 | 76.4181 | 186.4872 | 32.3692 | 94.3437 | | timm_efficientnet | 32 | 44.3179 | 55.9238 | 61.1327 | 52.0426 | 32.1713 | 36.463 | | resnet50 | 32 | 38.7145 | 38.8686 | 50.9664 | 39.3985 | 32.1241 | 33.0876 | | shufflenet_v2_x1_0 | 128 | 36.991 | 35.0723 | 45.8095 | 43.0224 | 24.4102 | 26.397 | | BERT_pytorch | 16 | 46.6255 | 62.6265 | 42.0364 | 51.0158 | 22.7401 | 23.0293 | | timm_resnest | 32 | 31.5602 | 31.5416 | 39.2934 | 27.1315 | 20.8029 | 21.7292 | | mnasnet1_0 | 32 | 29.6966 | 26.9879 | 33.1361 | 30.8906 | 19.4303 | 23.7434 | | pytorch_stargan | 16 | 24.174 | 22.4611 | 26.0504 | nan | 19.0789 | 19.6583 | | mobilenet_v3_large | 32 | 31.5778 | 28.4827 | 30.5181 | 39.5074 | 16.0792 | 24.2444 | | densenet121 | 4 | 64.2916 | 64.2707 | 29.0432 | 84.481 | 14.1145 | 52.3057 | | resnext50_32x4d | 8 | 26.4009 | 24.5575 | 22.133 | 33.4804 | 13.4385 | 22.9064 | | LearningToPaint | 96 | 15.6825 | 14.8179 | 18.1378 | 16.1528 | 12.4732 | 13.0103 | | alexnet | 128 | 12.3971 | 12.4475 | 15.4456 | 12.3631 | 10.6694 | 10.7001 | | nvidia_deeprecommender | 256 | 8.5233 | 8.8471 | 14.5632 | 8.73 | 9.4137 | 8.8355 | | timm_vision_transformer | 8 | 23.4338 | 25.6266 | 15.9693 | 36.6049 | 9.3888 | 17.9272 | | tts_angular | 64 | 9.3574 | 9.6501 | 9.2643 | 9.7015 | 9.3169 | 9.2088 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.4722 | 16.3832 | 13.5778 | 19.4573 | 9.2124 | 11.4414 | | squeezenet1_1 | 32 | 12.604 | 12.6038 | 12.07 | 14.5528 | 7.3222 | 10.0911 | | resnet18 | 16 | 11.955 | 10.9777 | 10.3279 | 13.877 | 6.5845 | 9.8657 | | functorch_dp_cifar10 | 64 | 11.5488 | 11.2929 | 6.2927 | nan | 3.2148 | 9.6721 | | pytorch_struct | 200 | 3.8161 | 4.9852 | 4.2604 | 4.9076 | 2.1027 | 3.333 | | drq | 1 | 3.007 | 3.3675 | 1.8979 | 5.0316 | 1.8966 | 2.8164 | | dcgan | 32 | 2.6983 | 2.6049 | 2.116 | 3.399 | 1.35 | 2.6351 | | soft_actor_critic | 256 | 0.9897 | 1.2979 | 0.9561 | 1.4826 | 0.7576 | 1.2632 | | lennard_jones | 1000 | 1.1188 | 1.2652 | 1.0828 | 1.5772 | 0.6459 | 1.2043 | | tacotron2 | 64 | 2823.0305 | 3104.347 | nan | 3494.5548 | nan | 3067.7252 | | dlrm | 2048 | nan | nan | nan | nan | nan | 486.2294 | | hf_GPT2_large | 4 | 240.853 | 245.2214 | nan | nan | nan | 163.069 | | hf_T5 | 8 | 182.8177 | 192.7472 | nan | 156.4829 | nan | 116.2336 | | hf_BigBird | 2 | 220.492 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0269 | 0.9347 | 1.8341 | 0.0 | 3.6429 | 1.4518 | | CamemBert | 1 | 1.0519 | 0.9721 | 1.3366 | 0.0 | 2.4269 | 1.5271 | | MT5ForConditionalGeneration | 8 | 0.9846 | 0.9136 | 1.1973 | 1.0013 | 2.308 | 2.0084 | | DistillGPT2 | 1 | 1.0318 | 0.9563 | 1.0513 | 0.0 | 2.0698 | 1.8972 | | MobileBertForMaskedLM | 32 | 1.0229 | 0.9098 | 1.1165 | 0.0 | 1.9601 | 1.5341 | | GoogleFnet | 1 | 0.9778 | 0.8087 | 0.9934 | 0.0 | 1.9229 | 1.0891 | | GPT2ForSequenceClassification | 4 | 1.0003 | 0.9781 | 0.0 | 0.6996 | 1.7988 | 1.7841 | | T5ForConditionalGeneration | 4 | 0.995 | 0.936 | 0.7258 | 1.1028 | 1.4479 | 1.4197 | | ElectraForQuestionAnswering | 64 | 1.0002 | 0.9833 | 0.0 | 1.1947 | 1.4264 | 1.4083 | | ElectraForCausalLM | 32 | 1.0005 | 0.9325 | 0.0 | 1.0068 | 1.4131 | 1.4511 | | MobileBertForQuestionAnswering | 64 | 1.0218 | 0.9134 | 0.8532 | 0.0 | 1.4019 | 1.3218 | | M2M100ForConditionalGeneration | 8 | 1.0105 | 0.997 | 0.8473 | 0.8516 | 1.3915 | 1.5134 | | LayoutLMForSequenceClassification | 16 | 1.0002 | 0.9893 | 0.7384 | 1.1177 | 1.3013 | 1.2902 | | T5Small | 1 | 1.0126 | 0.9273 | 0.9898 | 0.9624 | 1.281 | 1.2102 | | AlbertForQuestionAnswering | 4 | 0.9995 | 1.0013 | 0.0 | 1.229 | 1.2543 | 1.2518 | | AlbertForMaskedLM | 4 | 1.0002 | 1.0002 | 0.0 | 1.2229 | 1.2515 | 1.2473 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9711 | 0.0 | 1.0632 | 1.2091 | 1.2173 | | PLBartForConditionalGeneration | 16 | 1.0145 | 0.9693 | 0.8101 | 0.7952 | 1.1902 | 1.2036 | | OPTForCausalLM | 32 | 1.0064 | 0.9316 | 0.7139 | 0.4658 | 1.1862 | 1.2015 | | XGLMForCausalLM | 8 | 1.0123 | 0.8792 | 0.7347 | 0.3136 | 1.1793 | 1.181 | | DistilBertForQuestionAnswering | 64 | 1.0002 | 0.9872 | 0.7126 | 0.5065 | 1.1714 | 1.1507 | | RobertaForCausalLM | 64 | 1.0004 | 0.953 | 0.7457 | 0.9722 | 1.143 | 1.1524 | | MegatronBertForQuestionAnswering | 16 | 1.0385 | 1.0153 | 0.7584 | 0.8378 | 1.1391 | 1.1187 | | MegatronBertForCausalLM | 16 | 1.0341 | 0.9912 | 0.7439 | 0.9065 | 1.1373 | 1.1224 | | Speech2Text2ForCausalLM | 128 | 0.9974 | 0.9259 | 0.6602 | 0.9294 | 1.1183 | 1.1542 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9847 | 0.0 | 1.0263 | 1.1149 | 1.109 | | BertForQuestionAnswering | 128 | 0.9995 | 0.9928 | 0.0 | 1.0249 | 1.1119 | 1.1057 | | BartForCausalLM | 4 | 1.0005 | 0.9669 | 0.7548 | 1.0002 | 1.1005 | 1.1104 | | BartForConditionalGeneration | 2 | 1.0006 | 0.9883 | 0.0 | 0.4452 | 1.0986 | 1.0905 | | MBartForConditionalGeneration | 16 | 1.0107 | 0.9863 | 0.7599 | 0.8839 | 1.0888 | 1.1583 | | PegasusForConditionalGeneration | 16 | 1.0099 | 0.9777 | 0.755 | 0.7713 | 1.0874 | 1.0782 | | DebertaForMaskedLM | 4 | 0.9098 | 0.7929 | 0.7296 | 0.6353 | 1.0868 | 1.0447 | | BlenderbotSmallForConditionalGeneration | 64 | 1.003 | 0.9386 | 0.0 | 0.9541 | 1.0659 | 1.0796 | | BertForMaskedLM | 64 | 1.0002 | 0.9605 | 0.7305 | 0.9702 | 1.0591 | 1.0634 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.9516 | 0.7125 | 0.6211 | 1.0508 | 1.0668 | | DebertaForQuestionAnswering | 8 | 0.9972 | 0.9727 | 0.6835 | 0.8558 | 1.0498 | 1.2287 | | PLBartForCausalLM | 32 | 1.005 | 0.921 | 0.7176 | 0.8973 | 1.0283 | 1.0557 | | TrOCRForCausalLM | 32 | 1.0012 | 0.9537 | 0.7344 | 0.9551 | 1.0052 | 1.0147 | | BlenderbotSmallForCausalLM | 64 | 1.0008 | 0.9097 | 0.6839 | 0.9194 | 1.0033 | 1.0426 | | MBartForCausalLM | 32 | 1.0019 | 0.9512 | 0.7336 | 0.9541 | 1.0011 | 1.0099 | | PegasusForCausalLM | 32 | 0.9998 | 0.9536 | 0.7303 | 0.9507 | 0.994 | 1.005 | | BigBird | 1 | 0.9766 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.7498 | 10.0114 | 33.6842 | 75.8902 | 93.3984 | 37.1203 | | DebertaForMaskedLM | 4 | 4.7526 | 10.0048 | 33.5388 | 77.4193 | 92.1416 | 36.4102 | | XGLMForCausalLM | 8 | 2.6328 | 10.0896 | 21.5946 | 200.1793 | 69.9739 | 66.9029 | | M2M100ForConditionalGeneration | 8 | 3.4905 | 12.4291 | 21.7433 | 243.5563 | 58.8678 | 54.6402 | | MobileBertForMaskedLM | 32 | 8.8832 | 23.8534 | 40.6914 | nan | 52.182 | 51.796 | | MobileBertForQuestionAnswering | 64 | 8.7427 | 23.5805 | 40.9141 | nan | 51.6119 | 51.6484 | | BartForConditionalGeneration | 2 | 3.3084 | 12.5708 | nan | 283.8415 | 45.4687 | 43.2121 | | PegasusForConditionalGeneration | 16 | 3.0657 | 12.1594 | 20.3971 | 322.9576 | 44.3806 | 40.7387 | | MBartForConditionalGeneration | 16 | 3.2325 | 12.6816 | 21.4294 | 324.6583 | 44.2035 | 44.3241 | | YituTechConvBert | 1 | 2.4208 | 8.0654 | 12.4506 | nan | 41.1962 | 37.2696 | | MegatronBertForCausalLM | 16 | 3.4364 | 11.2485 | 16.7143 | 210.1677 | 35.1425 | 34.2819 | | MegatronBertForQuestionAnswering | 16 | 3.3426 | 10.779 | 16.6759 | 204.5266 | 34.238 | 33.0072 | | MT5ForConditionalGeneration | 8 | 3.7957 | 11.1262 | 17.6717 | 128.4433 | 33.3835 | 32.3533 | | BlenderbotSmallForConditionalGeneration | 64 | 2.1853 | 8.3136 | nan | 165.5247 | 30.55 | 30.0823 | | T5ForConditionalGeneration | 4 | 2.4795 | 7.713 | 11.4957 | 79.5264 | 29.5398 | 28.8984 | | T5Small | 1 | 2.5503 | 7.5549 | 11.3902 | 79.2349 | 28.6848 | 27.927 | | LayoutLMForSequenceClassification | 16 | 1.9713 | 5.8728 | 8.7613 | 88.6541 | 27.9 | 27.2805 | | PLBartForConditionalGeneration | 16 | 1.7295 | 6.6235 | 9.8856 | 121.3307 | 27.0267 | 26.238 | | GoogleFnet | 1 | 0.9295 | 2.9373 | 9.0525 | nan | 26.8773 | 14.8836 | | ElectraForCausalLM | 32 | 1.5864 | 5.4599 | nan | 85.3096 | 26.0057 | 24.3383 | | PegasusForCausalLM | 32 | 1.245 | 4.8927 | 8.2641 | 84.8352 | 21.3799 | 19.96 | | MBartForCausalLM | 32 | 1.1835 | 4.8346 | 7.2349 | 89.6149 | 21.1678 | 20.8854 | | LayoutLMForMaskedLM | 16 | 1.9476 | 5.7948 | nan | 87.5917 | 21.1508 | 21.2383 | | BertForMaskedLM | 64 | 1.5845 | 5.5125 | 8.1201 | 88.4292 | 20.3109 | 20.0982 | | TrOCRForCausalLM | 32 | 1.2659 | 4.8832 | 7.3209 | 76.8845 | 20.0643 | 19.2345 | | ElectraForQuestionAnswering | 64 | 1.6246 | 5.349 | nan | 82.6525 | 19.9807 | 19.5349 | | BertForQuestionAnswering | 128 | 1.6577 | 5.3978 | nan | 86.8752 | 19.661 | 19.4686 | | RobertaForCausalLM | 64 | 1.6105 | 5.4107 | 8.2117 | 87.5133 | 19.5769 | 19.018 | | BartForCausalLM | 4 | 1.2631 | 4.8035 | 7.2817 | 83.1776 | 19.4487 | 18.8719 | | OPTForCausalLM | 32 | 1.2702 | 5.1114 | 8.8299 | 76.1224 | 18.9621 | 18.7761 | | RobertaForQuestionAnswering | 128 | 1.7332 | 5.4389 | nan | 83.2797 | 18.8002 | 19.0292 | | CamemBert | 1 | 1.6642 | 5.3622 | 7.6165 | nan | 18.3876 | 18.3447 | | GPT2ForSequenceClassification | 4 | 1.509 | 5.2655 | nan | 63.8143 | 16.7833 | 16.0864 | | AlbertForMaskedLM | 4 | 1.3833 | 4.7577 | nan | 113.1694 | 16.2636 | 15.5695 | | AlbertForQuestionAnswering | 4 | 1.4279 | 4.7583 | nan | 110.7975 | 16.1949 | 15.1135 | | Speech2Text2ForCausalLM | 128 | 0.7544 | 2.6407 | 4.2724 | 36.8125 | 14.938 | 13.3645 | | BlenderbotSmallForCausalLM | 64 | 0.8779 | 3.1873 | 4.9707 | 55.4187 | 14.6735 | 14.0243 | | PLBartForCausalLM | 32 | 0.6927 | 2.6042 | 3.8783 | 43.1568 | 13.5054 | 13.0329 | | DistillGPT2 | 1 | 0.8413 | 2.683 | 3.8984 | nan | 12.1372 | 12.3481 | | DistilBertForMaskedLM | 64 | 0.7043 | 2.5572 | 4.4707 | 42.698 | 11.5966 | 11.0089 | | DistilBertForQuestionAnswering | 64 | 0.7401 | 2.7404 | 4.5525 | 41.1086 | 11.0583 | 10.432 | | BigBird | 1 | 3.4394 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | 0.3432 | 0.934 | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 1.0877 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9986 | 0.8218 | 0.3792 | nan | 0.8169 | 0.9378 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8608 | 0.3687 | nan | 0.7967 | 0.8799 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.8429 | 0.7929 | 0.9036 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 1.0402 | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.9515 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8867 | 0.4149 | 0.9322 | 0.7628 | 0.9397 | | M2M100ForConditionalGeneration | 8 | 0.9799 | 0.9718 | 0.3674 | 1.0255 | 0.7616 | 1.0075 | | GoogleFnet | 1 | 0.9629 | 0.9629 | 0.3851 | nan | 0.7589 | 0.969 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.9933 | 0.7528 | 0.9645 | | CamemBert | 1 | 0.998 | 0.8248 | 0.3614 | nan | 0.7485 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.9443 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8954 | 0.3582 | 1.0147 | 0.724 | 0.9375 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.9566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 1.0 | 0.8826 | 0.352 | 1.0007 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 1.1317 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 1.0014 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.9997 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.871 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9999 | 0.8656 | 0.3605 | 0.9158 | 0.6761 | 0.8845 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.888 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.9904 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 1.0 | 0.9206 | 0.3642 | 0.989 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | BigBird | 1 | 0.9548 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 385.9631 | 386.0535 | nan | 315.6099 | 309.593 | 309.8842 | | AlbertForQuestionAnswering | 4 | 383.8558 | 383.443 | nan | 312.0714 | 306.4321 | 306.9933 | | BartForConditionalGeneration | 2 | 149.609 | 151.6116 | nan | 337.1213 | 136.6005 | 137.3327 | | RobertaForQuestionAnswering | 128 | 148.4362 | 150.8139 | nan | 144.6478 | 133.4095 | 134.9019 | | BertForQuestionAnswering | 128 | 147.9437 | 148.8398 | nan | 144.2513 | 133.3403 | 134.1412 | | LayoutLMForMaskedLM | 16 | 136.6374 | 140.8118 | nan | 128.5482 | 113.2306 | 112.9685 | | BartForCausalLM | 4 | 123.2317 | 127.3521 | 163.4513 | 123.2182 | 112.0568 | 110.8998 | | BlenderbotSmallForConditionalGeneration | 64 | 119.6959 | 126.8909 | nan | 124.8833 | 111.9753 | 111.2172 | | MBartForConditionalGeneration | 16 | 104.3604 | 106.0527 | 138.4772 | 117.7401 | 95.9744 | 96.914 | | PegasusForConditionalGeneration | 16 | 102.5138 | 106.4827 | 137.5477 | 133.4437 | 95.9582 | 96.5637 | | MobileBertForQuestionAnswering | 64 | 133.0374 | 178.9004 | 153.7876 | nan | 95.6426 | 108.0456 | | BertForMaskedLM | 64 | 100.726 | 104.9907 | 138.3451 | 103.8457 | 95.5908 | 94.8313 | | RobertaForCausalLM | 64 | 108.962 | 114.4917 | 146.6265 | 112.0951 | 95.5831 | 94.5708 | | ElectraForQuestionAnswering | 64 | 124.8392 | 127.6287 | nan | 104.4846 | 87.4753 | 88.6287 | | LayoutLMForSequenceClassification | 16 | 113.1101 | 114.393 | 153.3641 | 101.2147 | 87.049 | 87.7052 | | PegasusForCausalLM | 32 | 85.4005 | 89.4922 | 117.6231 | 89.7746 | 86.2867 | 84.9673 | | MBartForCausalLM | 32 | 85.7864 | 90.1548 | 117.2631 | 89.9408 | 86.0977 | 84.9623 | | TrOCRForCausalLM | 32 | 86.1899 | 90.0138 | 117.2156 | 89.8502 | 85.7655 | 84.6265 | | DebertaForQuestionAnswering | 8 | 81.8612 | 83.7858 | 119.653 | 95.5904 | 77.821 | 67.1733 | | ElectraForCausalLM | 32 | 105.7143 | 113.264 | nan | 105.201 | 74.8852 | 72.8859 | | T5ForConditionalGeneration | 4 | 104.1962 | 111.4074 | 143.5999 | 94.3935 | 71.9121 | 72.8565 | | MegatronBertForCausalLM | 16 | 78.7853 | 86.5794 | 108.1397 | 88.6837 | 71.0352 | 71.918 | | XGLMForCausalLM | 8 | 78.5515 | 93.0205 | 108.5739 | 254.5738 | 68.2901 | 67.9558 | | MobileBertForMaskedLM | 32 | 134.9397 | 179.644 | 116.4847 | nan | 67.5944 | 90.816 | | MegatronBertForQuestionAnswering | 16 | 72.1906 | 73.2255 | 98.5145 | 88.6731 | 65.3611 | 66.5916 | | M2M100ForConditionalGeneration | 8 | 85.9906 | 94.4832 | 101.3928 | 99.217 | 64.3557 | 68.5516 | | BlenderbotSmallForCausalLM | 64 | 64.9138 | 70.9761 | 94.4415 | 70.3597 | 64.2318 | 61.9627 | | DistilBertForMaskedLM | 64 | 63.2926 | 66.4624 | 88.8557 | 101.8573 | 60.3434 | 59.315 | | GPT2ForSequenceClassification | 4 | 102.1298 | 104.4638 | nan | 146.0922 | 56.8172 | 57.1998 | | DebertaForMaskedLM | 4 | 65.9367 | 75.8612 | 82.425 | 93.0798 | 54.9303 | 57.4511 | | OPTForCausalLM | 32 | 61.7387 | 66.4546 | 86.8474 | 132.8071 | 52.3876 | 51.4832 | | T5Small | 1 | 55.6445 | 68.3743 | 56.7179 | 65.6033 | 44.9713 | 46.9905 | | PLBartForConditionalGeneration | 16 | 52.3248 | 50.7672 | 60.9178 | 61.3415 | 40.7657 | 40.9629 | | PLBartForCausalLM | 32 | 41.2485 | 44.7987 | 57.7016 | 45.8768 | 40.2663 | 39.2516 | | MT5ForConditionalGeneration | 8 | 81.8928 | 97.3919 | 63.5614 | 83.6111 | 34.3991 | 43.1826 | | DistilBertForQuestionAnswering | 64 | 39.6988 | 40.2279 | 55.7975 | 78.4131 | 34.0035 | 34.464 | | Speech2Text2ForCausalLM | 128 | 35.4002 | 37.9488 | 53.1044 | 37.7763 | 31.5379 | 30.6998 | | YituTechConvBert | 1 | 51.3442 | 64.76 | 28.5564 | nan | 16.1182 | 37.6965 | | CamemBert | 1 | 30.0259 | 32.4806 | 23.1599 | nan | 13.2432 | 21.3188 | | GoogleFnet | 1 | 19.6343 | 23.0327 | 19.0203 | nan | 11.1079 | 17.8829 | | DistillGPT2 | 1 | 17.3653 | 18.1379 | 16.4722 | nan | 8.6414 | 10.5932 | | BigBird | 1 | 187.3101 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9991 | 0.9735 | 0.8271 | 1.287 | 1.8711 | 1.8282 | | lcnet_050 | 128 | 0.9563 | 0.9501 | 0.7483 | 1.3417 | 1.6519 | 1.6174 | | regnety_002 | 128 | 0.979 | 0.9999 | 0.8653 | 0.9682 | 1.5063 | 1.3129 | | dm_nfnet_f0 | 128 | 0.9996 | 1.001 | 0.0 | 1.1372 | 1.472 | 1.4218 | | xcit_large_24_p8_224 | 5 | 1.003 | 0.9946 | 0.797 | 0.0 | 1.4253 | 1.4149 | | hrnet_w18 | 128 | 0.9997 | 0.9985 | 0.0 | 1.2822 | 1.4177 | 1.3777 | | dla102 | 128 | 1.0 | 1.0008 | 0.0 | 1.2835 | 1.3834 | 1.3684 | | volo_d1_224 | 64 | 0.9998 | 0.9953 | 0.8027 | 0.0 | 1.3814 | 1.3654 | | nfnet_l0 | 128 | 0.9994 | 0.7885 | 0.0 | 1.1057 | 1.3716 | 1.3235 | | res2net50_14w_8s | 128 | 0.9999 | 0.9993 | 0.0 | 1.2523 | 1.3554 | 1.3251 | | mobilenetv2_100 | 128 | 0.9656 | 0.9624 | 0.7067 | 1.2871 | 1.3338 | 1.3544 | | mobilenetv3_large_100 | 128 | 0.9659 | 0.9605 | 0.7633 | 1.2856 | 1.3299 | 1.3428 | | crossvit_9_240 | 128 | 0.9995 | 0.9986 | 0.7604 | 1.0403 | 1.3297 | 1.3053 | | inception_v3 | 128 | 1.0 | 0.9993 | 0.0 | 1.1292 | 1.3265 | 1.307 | | gluon_inception_v3 | 128 | 1.0 | 0.9993 | 0.0 | 1.129 | 1.3262 | 1.3095 | | adv_inception_v3 | 128 | 0.9999 | 0.9986 | 0.0 | 1.1289 | 1.3259 | 1.3091 | | resnest101e | 64 | 1.0001 | 1.003 | 0.0 | 1.1693 | 1.3149 | 1.2683 | | res2next50 | 128 | 0.9997 | 1.0007 | 0.0 | 1.1803 | 1.3124 | 1.2741 | | coat_lite_mini | 128 | 1.0002 | 1.0003 | 0.8461 | 1.0933 | 1.3035 | 1.3121 | | fbnetv3_b | 128 | 0.9652 | 0.9611 | 0.7601 | 1.2413 | 1.2818 | 1.2971 | | jx_nest_base | 32 | 0.9997 | 0.9924 | 0.7353 | 0.0 | 1.2811 | 1.2513 | | botnet26t_256 | 128 | 0.9855 | 0.9853 | 0.7895 | 0.0 | 1.2758 | 1.2797 | | gmixer_24_224 | 128 | 1.0 | 0.8345 | 0.0 | 1.086 | 1.2734 | 1.2485 | | selecsls42b | 128 | 0.9998 | 0.9987 | 0.8147 | 1.2111 | 1.2663 | 1.2529 | | mnasnet_100 | 128 | 0.9666 | 0.9637 | 0.7853 | 1.2529 | 1.2636 | 1.2795 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7831 | 0.0 | 1.1628 | 1.2593 | 1.2679 | | convit_base | 64 | 0.9997 | 0.9989 | 0.0 | 0.0 | 1.2532 | 1.225 | | sebotnet33ts_256 | 64 | 0.9761 | 0.8067 | 0.0 | 0.0 | 1.2476 | 1.2611 | | cait_m36_384 | 4 | 0.9997 | 0.9994 | 0.0 | 0.0 | 1.2459 | 1.2197 | | fbnetc_100 | 128 | 0.9664 | 0.9609 | 0.7892 | 1.2446 | 1.2456 | 1.2636 | | eca_botnext26ts_256 | 128 | 0.9864 | 0.7725 | 0.0 | 0.0 | 1.2456 | 1.2289 | | eca_halonext26ts | 128 | 0.9868 | 0.7788 | 0.0 | 0.0 | 1.2416 | 1.2211 | | ese_vovnet19b_dw | 128 | 0.979 | 0.9777 | 0.7436 | 1.1487 | 1.2373 | 1.2467 | | spnasnet_100 | 128 | 0.9614 | 0.9575 | 0.7725 | 1.2222 | 1.2351 | 1.253 | | res2net101_26w_4s | 64 | 1.0 | 0.9962 | 0.7724 | 1.0977 | 1.227 | 1.1885 | | cspdarknet53 | 64 | 0.9577 | 0.954 | 0.7352 | 1.1741 | 1.2195 | 1.2266 | | gmlp_s16_224 | 128 | 0.9999 | 0.9994 | 0.0 | 1.0916 | 1.2143 | 1.203 | | rexnet_100 | 128 | 0.9727 | 0.8163 | 0.0 | 1.1604 | 1.2125 | 1.2201 | | pnasnet5large | 16 | 0.9998 | 0.9982 | 0.0 | 1.09 | 1.208 | 1.1916 | | pit_b_224 | 64 | 1.0004 | 0.9996 | 0.0 | 1.033 | 1.2031 | 1.1933 | | tinynet_a | 128 | 0.9667 | 0.7752 | 0.6199 | 1.1474 | 1.1891 | 1.2003 | | dpn107 | 32 | 0.9579 | 0.9505 | 0.7795 | 1.0259 | 1.1853 | 1.2025 | | mobilevit_s | 64 | 0.979 | 0.7618 | 0.0 | 0.0 | 1.1718 | 1.1709 | | poolformer_m36 | 64 | 0.9999 | 0.9993 | 0.0 | 0.0 | 1.1672 | 1.1474 | | tf_mixnet_l | 128 | 0.9854 | 0.8894 | 0.0 | 1.094 | 1.1665 | 1.1634 | | convnext_base | 64 | 0.9999 | 0.9985 | 0.0 | 0.0 | 1.1519 | 1.1031 | | mixnet_l | 128 | 0.985 | 0.8857 | 0.0 | 1.0989 | 1.1457 | 1.1447 | | repvgg_a2 | 128 | 0.9638 | 0.9629 | 0.8282 | 1.137 | 1.1447 | 1.1445 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9776 | 0.0 | 0.0 | 1.1414 | 1.1358 | | twins_pcpvt_base | 64 | 1.0002 | 0.9989 | 0.7497 | 0.0 | 1.1403 | 1.1032 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9813 | 0.0 | 0.0 | 1.1164 | 1.1032 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.999 | 0.0 | 1.1081 | 1.1102 | 1.0709 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 1.0001 | 0.7697 | 0.981 | 1.0946 | 1.084 | | gluon_xception65 | 32 | 0.9999 | 0.9971 | 0.0 | 1.0812 | 1.0872 | 1.0749 | | vit_base_patch16_224 | 64 | 0.9998 | 0.999 | 0.7681 | 0.9542 | 1.0863 | 1.0743 | | mixer_b16_224 | 128 | 1.0 | 0.9986 | 0.0 | 0.8959 | 1.0831 | 1.0748 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 0.0 | 1.0764 | 1.0748 | | gernet_l | 128 | 0.9742 | 0.9734 | 0.8223 | 1.0979 | 1.0747 | 1.0716 | | visformer_small | 128 | 1.0001 | 1.0024 | 0.7949 | 0.0 | 1.0377 | 1.0076 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6941 | 1.2096 | 0.9892 | 0.986 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9994 | 0.0 | 0.0 | 0.0 | 1.5448 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.2662 | 10.5379 | 18.4938 | nan | 117.1788 | 115.6606 | | hrnet_w18 | 128 | 6.2333 | 25.0274 | nan | 649.1476 | 115.9547 | 109.3871 | | mobilevit_s | 64 | 1.7054 | 6.0053 | nan | nan | 77.419 | 75.6012 | | pnasnet5large | 16 | 4.7195 | 18.1927 | nan | 346.9498 | 77.0455 | 72.1517 | | xcit_large_24_p8_224 | 5 | 2.9091 | 14.0692 | 26.6831 | nan | 76.8489 | 74.1788 | | swin_base_patch4_window7_224 | 64 | 2.757 | 10.3892 | nan | nan | 76.5992 | 75.1545 | | cait_m36_384 | 4 | 3.0168 | 14.6241 | nan | nan | 65.2443 | 60.6939 | | convnext_base | 64 | 1.4476 | 5.5304 | nan | nan | 64.3703 | 62.9312 | | resnest101e | 64 | 3.2759 | 12.8374 | nan | 285.3705 | 61.5896 | 58.6567 | | jx_nest_base | 32 | 1.8915 | 7.4474 | 13.0559 | nan | 54.7317 | 53.6902 | | res2net101_26w_4s | 64 | 3.1538 | 13.5541 | 23.3207 | 245.7878 | 53.4548 | 49.9894 | | eca_halonext26ts | 128 | 1.4725 | 4.4696 | nan | nan | 51.4422 | 49.5807 | | coat_lite_mini | 128 | 1.0194 | 4.0905 | 6.6678 | 97.0521 | 49.8241 | 47.5565 | | res2net50_14w_8s | 128 | 2.8196 | 12.2653 | nan | 260.1793 | 48.6062 | 46.1032 | | poolformer_m36 | 64 | 1.7699 | 6.3245 | nan | nan | 48.0624 | 45.5522 | | sebotnet33ts_256 | 64 | 1.6731 | 5.2279 | nan | nan | 43.9028 | 42.3897 | | dpn107 | 32 | 3.9223 | 11.7534 | 36.1636 | 177.6673 | 40.7693 | 37.429 | | gmlp_s16_224 | 128 | 1.066 | 5.5056 | nan | 160.1046 | 40.0627 | 38.5161 | | volo_d1_224 | 64 | 1.3066 | 6.1542 | 9.958 | nan | 38.4627 | 35.4659 | | gluon_xception65 | 32 | 1.9908 | 8.9549 | nan | 158.7971 | 37.8585 | 34.6523 | | fbnetv3_b | 128 | 3.1698 | 9.486 | 25.1844 | 238.9329 | 37.806 | 35.5407 | | crossvit_9_240 | 128 | 1.4676 | 6.817 | 10.5583 | 176.1551 | 36.4914 | 34.6969 | | eca_botnext26ts_256 | 128 | 1.4241 | 4.3242 | nan | nan | 36.326 | 35.4043 | | gluon_inception_v3 | 128 | 1.6685 | 6.9413 | nan | 151.7194 | 33.4678 | 31.9209 | | adv_inception_v3 | 128 | 1.5819 | 6.904 | nan | 149.5679 | 33.2898 | 31.5152 | | ghostnet_100 | 128 | 3.0485 | 8.1957 | 12.1432 | 160.6422 | 32.9818 | 31.1739 | | inception_v3 | 128 | 1.6487 | 7.1357 | nan | 146.1908 | 32.9462 | 31.339 | | botnet26t_256 | 128 | 1.2923 | 3.6693 | 8.2489 | nan | 32.9312 | 32.4912 | | tf_mixnet_l | 128 | 5.7277 | 11.4587 | nan | 158.9035 | 32.84 | 32.6332 | | dla102 | 128 | 1.8574 | 7.7022 | nan | 180.8735 | 31.7837 | 29.9326 | | mixnet_l | 128 | 5.4917 | 11.1742 | nan | 155.5895 | 31.7229 | 29.614 | | swsl_resnext101_32x16d | 32 | 1.8085 | 7.5782 | nan | 131.1815 | 31.2384 | 29.9707 | | dm_nfnet_f0 | 128 | 2.0497 | 6.1138 | nan | 153.2069 | 31.0927 | 29.0205 | | gmixer_24_224 | 128 | 1.1529 | 5.9609 | nan | 138.6416 | 30.8946 | 30.1959 | | convit_base | 64 | 0.9797 | 4.5641 | nan | nan | 28.6165 | 26.2538 | | res2next50 | 128 | 1.6273 | 6.8538 | nan | 155.9063 | 28.274 | 26.2134 | | rexnet_100 | 128 | 1.9291 | 6.2522 | nan | 148.439 | 26.7564 | 25.2485 | | tinynet_a | 128 | 2.0822 | 6.7759 | 17.5976 | 144.2778 | 26.3711 | 25.0549 | | visformer_small | 128 | 0.8832 | 3.4107 | 5.3627 | nan | 23.1768 | 21.5757 | | tf_efficientnet_b0 | 128 | 1.8301 | 5.8355 | nan | 129.9398 | 23.0551 | 21.8491 | | cspdarknet53 | 64 | 2.2746 | 6.4019 | 17.0733 | 122.2459 | 22.6779 | 21.5023 | | mixer_b16_224 | 128 | 0.6542 | 2.6471 | nan | 69.6444 | 22.5674 | 21.2135 | | nfnet_l0 | 128 | 1.761 | 6.3546 | nan | 133.8996 | 22.2467 | 20.4495 | | resmlp_12_224 | 128 | 0.5654 | 2.3067 | 3.9204 | 32.6713 | 22.1405 | 21.1308 | | fbnetc_100 | 128 | 2.0353 | 5.7171 | 15.9073 | 113.2999 | 21.9797 | 20.5043 | | spnasnet_100 | 128 | 2.0285 | 5.4653 | 15.4472 | 107.7142 | 21.5125 | 21.0685 | | convmixer_768_32 | 32 | 1.2475 | 5.0269 | nan | nan | 21.458 | 20.7408 | | mobilenetv3_large_100 | 128 | 1.5667 | 4.7944 | 11.8293 | 121.1485 | 20.8217 | 19.5016 | | repvgg_a2 | 128 | 1.9831 | 5.3062 | 14.2171 | 161.4788 | 20.0077 | 18.8187 | | beit_base_patch16_224 | 64 | 1.1023 | 4.3887 | nan | nan | 19.8758 | 18.6612 | | pit_b_224 | 64 | 0.9264 | 3.8181 | nan | 98.6394 | 19.757 | 18.8757 | | deit_base_distilled_patch16_224 | 64 | 0.8117 | 3.4463 | 5.4908 | 75.5551 | 19.4952 | 18.5132 | | mobilenetv2_100 | 128 | 1.61 | 4.8353 | 11.9812 | 102.1246 | 19.1315 | 18.2086 | | vit_base_patch16_224 | 64 | 0.8362 | 3.3856 | 5.5024 | 73.459 | 19.0401 | 18.31 | | gernet_l | 128 | 1.955 | 5.1558 | 13.9513 | 88.0253 | 18.3595 | 17.001 | | mnasnet_100 | 128 | 1.6059 | 4.6626 | 11.8794 | 91.4275 | 18.3286 | 17.2686 | | regnety_002 | 128 | 1.6044 | 4.6072 | 11.5341 | 93.7894 | 17.7731 | 17.4907 | | selecsls42b | 128 | 0.7384 | 3.1339 | 4.8568 | 76.2275 | 15.8947 | 15.0424 | | lcnet_050 | 128 | 0.9166 | 2.8132 | 6.9165 | 68.1013 | 13.6077 | 12.3519 | | ese_vovnet19b_dw | 128 | 0.9474 | 2.4593 | 5.8873 | 55.5635 | 13.1872 | 11.9029 | | tnt_s_patch16_224 | 128 | 1.6823 | 8.2162 | nan | nan | nan | 32.0803 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.9898 | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.2908 | 1.4944 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.177 | 1.3423 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1803 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1107 | 1.3608 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1106 | 1.3328 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0528 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 1.0219 | 1.0956 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9799 | 1.0636 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9764 | 1.0866 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9632 | 1.0419 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9616 | 1.054 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9576 | 0.9919 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.9519 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8646 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9069 | 1.0618 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0515 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9058 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9034 | 0.9939 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8525 | 1.0754 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8441 | 1.0596 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.841 | 0.9709 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6493 | 0.869 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.5319 | 0.8172 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 364.7164 | 364.576 | nan | nan | 338.7634 | 339.1822 | | hrnet_w18 | 128 | 415.7948 | 416.0197 | nan | 324.4762 | 293.5078 | 301.4232 | | pnasnet5large | 16 | 288.9634 | 288.8862 | nan | 264.5906 | 239.3012 | 242.1155 | | convnext_base | 64 | 262.8676 | 263.0706 | nan | nan | 229.1369 | 239.352 | | tf_mixnet_l | 128 | 256.4151 | 283.9661 | nan | 231.0609 | 216.6083 | 217.1789 | | mixnet_l | 128 | 247.0442 | 274.7074 | nan | 221.4583 | 212.457 | 212.5653 | | swin_base_patch4_window7_224 | 64 | 236.6839 | 241.976 | nan | nan | 207.4447 | 208.4217 | | swsl_resnext101_32x16d | 32 | 219.7108 | 219.7656 | nan | 197.8154 | 198.0054 | 204.5629 | | dla102 | 128 | 269.068 | 268.9229 | nan | 209.58 | 194.6572 | 196.6669 | | resnest101e | 64 | 230.5828 | 229.5787 | nan | 196.583 | 175.4395 | 180.9957 | | cait_m36_384 | 4 | 216.2465 | 216.2886 | nan | nan | 173.6305 | 177.4594 | | adv_inception_v3 | 128 | 226.1568 | 226.3618 | nan | 200.3687 | 170.5893 | 172.7345 | | inception_v3 | 128 | 226.1126 | 226.1946 | nan | 200.2971 | 170.5055 | 173.1621 | | gluon_inception_v3 | 128 | 226.2116 | 226.5072 | nan | 200.4811 | 170.4984 | 172.8465 | | res2net50_14w_8s | 128 | 228.9901 | 229.3325 | nan | 182.7602 | 168.9222 | 172.7704 | | gluon_xception65 | 32 | 182.5665 | 182.6737 | nan | 168.4408 | 167.7709 | 169.5174 | | res2next50 | 128 | 206.7398 | 206.5022 | nan | 175.179 | 157.6009 | 162.349 | | convit_base | 64 | 196.3662 | 196.4801 | nan | nan | 156.8487 | 160.1714 | | dpn107 | 32 | 191.0031 | 192.3276 | 234.4257 | 177.9332 | 154.144 | 152.0983 | | gernet_l | 128 | 164.9369 | 165.0954 | 193.56 | 146.4964 | 149.6372 | 149.8842 | | poolformer_m36 | 64 | 174.1432 | 174.5289 | nan | nan | 149.4695 | 151.9038 | | mixer_b16_224 | 128 | 159.4998 | 159.6939 | nan | 177.2573 | 147.7952 | 148.5498 | | coat_lite_mini | 128 | 191.1736 | 191.314 | 226.3225 | 174.9499 | 146.9095 | 146.0035 | | dm_nfnet_f0 | 128 | 206.3559 | 206.0075 | nan | 180.6911 | 139.5365 | 144.6337 | | eca_halonext26ts | 128 | 169.3788 | 214.5123 | nan | nan | 134.4727 | 136.7348 | | pit_b_224 | 64 | 158.2687 | 158.3701 | nan | 153.1735 | 131.5401 | 132.5351 | | eca_botnext26ts_256 | 128 | 163.429 | 208.628 | nan | nan | 129.4111 | 131.0966 | | nfnet_l0 | 128 | 175.8015 | 223.0235 | nan | 159.0239 | 128.4791 | 133.1477 | | gmlp_s16_224 | 128 | 151.8464 | 152.0645 | nan | 139.0152 | 125.0592 | 126.2259 | | res2net101_26w_4s | 64 | 151.3676 | 151.7924 | 196.1314 | 137.9116 | 123.4674 | 127.317 | | visformer_small | 128 | 128.2997 | 128.0302 | 161.5158 | nan | 123.3793 | 127.0886 | | fbnetv3_b | 128 | 162.1036 | 162.8321 | 206.0775 | 126.0538 | 122.2982 | 120.6191 | | twins_pcpvt_base | 64 | 136.8728 | 137.305 | 182.4758 | nan | 120.453 | 124.0717 | | botnet26t_256 | 128 | 152.0042 | 152.1171 | 189.9622 | nan | 117.4753 | 117.1338 | | beit_base_patch16_224 | 64 | 128.3477 | 130.8387 | nan | nan | 115.4926 | 116.3888 | | gmixer_24_224 | 128 | 146.1997 | 175.2705 | nan | 134.7032 | 115.0664 | 117.2085 | | volo_d1_224 | 64 | 153.163 | 153.7338 | 191.0416 | nan | 111.0237 | 112.1878 | | vit_base_patch16_224 | 64 | 119.6743 | 119.7557 | 155.6698 | 125.1008 | 110.3185 | 111.4237 | | deit_base_distilled_patch16_224 | 64 | 120.4873 | 120.4018 | 156.5731 | 122.6324 | 110.2366 | 111.1239 | | repvgg_a2 | 128 | 127.2434 | 127.1987 | 146.1325 | 107.6755 | 107.0431 | 107.0365 | | tf_efficientnet_b0 | 128 | 133.821 | 166.9694 | nan | 112.3867 | 103.7357 | 103.0019 | | cspdarknet53 | 64 | 130.1397 | 130.6222 | 169.7018 | 106.2961 | 102.2441 | 101.553 | | mobilevit_s | 64 | 117.0073 | 150.2937 | nan | nan | 97.6156 | 97.7899 | | xcit_large_24_p8_224 | 5 | 136.6401 | 138.7298 | 173.6634 | nan | 95.8585 | 98.7903 | | fbnetc_100 | 128 | 123.0773 | 123.9101 | 150.9197 | 95.6147 | 95.5325 | 94.1965 | | rexnet_100 | 128 | 119.0708 | 141.9074 | nan | 99.8112 | 95.5268 | 94.9581 | | jx_nest_base | 32 | 121.2089 | 122.1915 | 164.9378 | nan | 94.6573 | 96.7501 | | sebotnet33ts_256 | 64 | 114.2967 | 138.3142 | nan | nan | 89.3991 | 88.4513 | | tinynet_a | 128 | 109.7138 | 136.8456 | 171.3042 | 92.3455 | 89.2493 | 88.3438 | | spnasnet_100 | 128 | 105.7773 | 106.1651 | 131.883 | 83.1317 | 82.4138 | 81.1178 | | ese_vovnet19b_dw | 128 | 99.4155 | 99.627 | 131.0117 | 84.8229 | 78.6392 | 78.074 | | mnasnet_100 | 128 | 98.5184 | 98.8164 | 121.3929 | 75.9443 | 75.4326 | 74.4289 | | crossvit_9_240 | 128 | 98.4139 | 98.5114 | 129.5418 | 94.4011 | 74.0945 | 75.2598 | | resmlp_12_224 | 128 | 71.0752 | 71.0736 | 102.5055 | 58.7646 | 72.0705 | 72.1462 | | mobilenetv2_100 | 128 | 97.6118 | 97.9058 | 133.4812 | 73.2275 | 70.6758 | 69.5615 | | selecsls42b | 128 | 89.4777 | 89.6136 | 109.7714 | 73.9275 | 70.6353 | 71.4362 | | mobilenetv3_large_100 | 128 | 85.3065 | 85.8947 | 108.2134 | 64.176 | 62.009 | 61.3691 | | ghostnet_100 | 128 | 114.6433 | 117.4849 | 138.5621 | 88.878 | 61.2645 | 62.5794 | | regnety_002 | 128 | 53.3624 | 51.7898 | 60.5616 | 54.0045 | 34.9635 | 42.0852 | | lcnet_050 | 128 | 38.1917 | 38.4966 | 48.8487 | 27.2549 | 22.1532 | 22.5955 | | tnt_s_patch16_224 | 128 | 470.1132 | 469.9544 | nan | nan | nan | 304.3002 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/PkAiBBu.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/549uI7Y.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/Rxgrc89.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 97%, 59/61  |
|     aot_cudagraphs     | 80%, 43/54 | 86%, 36/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 83%, 45/54 | 90%, 38/42  | 92%, 56/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 90%, 38/42  | 92%, 56/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.24x    |    1.10x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.04x    |    1.08x    |
|        inductor        |   1.89x    |    1.81x    |    1.43x    |
| inductor_no_cudagraphs |   1.38x    |    1.57x    |    1.37x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.02    |    2.81     |    2.27     |
|       aot_eager        |    6.63    |    9.97     |    8.72     |
|     aot_cudagraphs     |    9.77    |    17.33    |    16.24    |
|    nvprims_nvfuser     |   63.87    |   113.26    |   148.67    |
|        inductor        |   32.66    |    35.79    |    43.33    |
| inductor_no_cudagraphs |   32.43    |    31.42    |    41.22    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.83x    |    0.89x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.83x    |    0.88x    |    0.95x    |
| inductor_no_cudagraphs |   0.95x    |    1.05x    |    1.06x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8638 | | torchbench | tacotron2 | 0.0 | 0.8858 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6647 | 0.6597 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 410.1974 | 404.9732 | | torchbench | timm_efficientdet | 144.1048 | 140.1487 | | torchbench | hf_T5_large | 127.766 | 123.869 | | timm_models | hrnet_w18 | 147.2362 | 132.9441 | | timm_models | twins_pcpvt_base | 127.3752 | 125.1997 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | timm_vision_transformer_large | 0.879 | 0.9541 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | fastNLP_Bert | 0.8657 | 1.0681 | | torchbench | Background_Matting | 0.8561 | 1.0426 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8384 | 0.9048 | | torchbench | hf_Bart | 0.8231 | 1.0097 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | LearningToPaint | 0.7299 | 0.925 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0017 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4478 | 0.4688 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0179 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0276 | | huggingface | PLBartForConditionalGeneration | 0.8843 | 1.0284 | | huggingface | DistilBertForMaskedLM | 0.8802 | 0.948 | | huggingface | MT5ForConditionalGeneration | 0.8751 | 0.919 | | huggingface | Speech2Text2ForCausalLM | 0.869 | 0.949 | | huggingface | ElectraForCausalLM | 0.856 | 0.9328 | | huggingface | PLBartForCausalLM | 0.8549 | 0.9361 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | CamemBert | 0.8061 | 0.9309 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | DistillGPT2 | 0.8046 | 1.024 | | huggingface | YituTechConvBert | 0.791 | 0.9314 | | huggingface | M2M100ForConditionalGeneration | 0.7585 | 1.0035 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9454 | | huggingface | MobileBertForQuestionAnswering | 0.6085 | 0.8221 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | convnext_base | 0.8095 | 0.9865 | | timm_models | resmlp_12_224 | 0.8092 | 0.8236 | | timm_models | coat_lite_mini | 0.8033 | 1.0359 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | sebotnet33ts_256 | 0.7449 | 0.8293 | | timm_models | jx_nest_base | 0.6707 | 0.8617 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

bench_logs/passrate_over_time.png : ![](https://i.imgur.com/semQTbZ.png) bench_logs/geomean_over_time.png : ![](https://i.imgur.com/B25sdeY.png)

Accuracy Regressions

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 0.9984 | 0.929 | 2.4798 | 0.7319 | 5.6889 | 1.3233 | | functorch_dp_cifar10 | 64 | 1.0032 | 0.956 | 2.3929 | 0.0 | 4.948 | 1.3429 | | timm_efficientdet | 1 | 0.9864 | 0.8095 | 2.1142 | 0.0 | 4.6557 | 1.5392 | | resnext50_32x4d | 8 | 1.0011 | 0.9622 | 1.9105 | 0.7517 | 3.5511 | 1.2641 | | BERT_pytorch | 16 | 1.007 | 0.8457 | 1.5557 | 0.7802 | 3.5306 | 2.3902 | | timm_vision_transformer | 8 | 1.0072 | 0.8573 | 1.8197 | 0.6021 | 3.4705 | 1.5487 | | mobilenet_v3_large | 32 | 1.0031 | 1.0118 | 1.5074 | 0.7604 | 3.0928 | 1.3924 | | drq | 1 | 1.0066 | 0.8243 | 2.0029 | 0.6225 | 2.9841 | 1.2544 | | dcgan | 32 | 0.9895 | 0.9195 | 1.6298 | 0.7228 | 2.8381 | 1.051 | | resnet18 | 16 | 0.9998 | 0.9998 | 1.5566 | 0.7953 | 2.7974 | 1.2471 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9977 | 0.9824 | 1.7765 | 0.0 | 2.7542 | 1.5773 | | mnasnet1_0 | 32 | 1.0006 | 1.0198 | 1.2717 | 0.7653 | 2.6067 | 1.3512 | | hf_T5_large | 2 | 1.0221 | 0.861 | 0.0 | 0.0 | 2.581 | 2.1474 | | squeezenet1_1 | 32 | 0.9949 | 0.9744 | 1.488 | 0.718 | 2.396 | 1.2974 | | hf_Albert | 8 | 1.0016 | 0.955 | 0.7742 | 0.0 | 2.3752 | 2.322 | | resnet152 | 32 | 1.0006 | 1.0007 | 1.2658 | 0.0 | 2.1659 | 1.3085 | | timm_efficientnet | 32 | 0.9587 | 0.8091 | 1.0846 | 0.68 | 2.1293 | 1.2861 | | pytorch_struct | 200 | 0.9898 | 0.8069 | 1.0128 | 0.5947 | 2.1147 | 1.2797 | | hf_Bert | 4 | 1.0362 | 0.8687 | 0.9559 | 0.0 | 2.1034 | 1.8076 | | lennard_jones | 1000 | 0.9613 | 0.7729 | 1.2851 | 0.4638 | 2.0958 | 1.054 | | hf_GPT2 | 4 | 1.0235 | 0.9874 | 0.8856 | 0.2914 | 1.9696 | 1.901 | | resnet50 | 32 | 1.0069 | 1.0198 | 1.0485 | 0.8105 | 1.9102 | 1.3505 | | timm_resnest | 32 | 1.0069 | 1.0239 | 0.8411 | 0.9559 | 1.9077 | 1.6872 | | hf_T5 | 8 | 1.0003 | 0.9274 | 0.0 | 1.3524 | 1.8674 | 1.8742 | | LearningToPaint | 96 | 1.0 | 1.0085 | 1.182 | 0.8143 | 1.8366 | 1.2818 | | hf_Bart | 4 | 1.0088 | 0.8435 | 0.8723 | 0.0 | 1.8318 | 1.7224 | | soft_actor_critic | 256 | 1.0084 | 0.7501 | 1.2898 | 0.5727 | 1.7535 | 1.0556 | | shufflenet_v2_x1_0 | 128 | 1.0004 | 1.018 | 0.9726 | 0.8548 | 1.7024 | 1.4186 | | attention_is_all_you_need_pytorch | 256 | 1.0101 | 0.9248 | 0.8313 | 0.0 | 1.5799 | 1.5557 | | mobilenet_v2 | 96 | 0.9998 | 0.9893 | 0.7598 | 1.0436 | 1.5615 | 1.5186 | | speech_transformer | 32 | 1.0029 | 0.8284 | 1.7343 | 0.6359 | 1.5556 | 1.5362 | | fastNLP_Bert | 6 | 1.0007 | 0.8899 | 0.7707 | 0.0 | 1.5334 | 1.4715 | | hf_DistilBert | 8 | 0.9994 | 0.9723 | 0.7425 | 0.3623 | 1.5141 | 1.4855 | | timm_nfnet | 128 | 1.0003 | 0.9989 | 0.8723 | 0.9173 | 1.5085 | 1.4307 | | pytorch_stargan | 16 | 0.9955 | 1.0975 | 1.0518 | 0.0 | 1.4817 | 1.4092 | | timm_regnet | 32 | 0.9788 | 0.9387 | 0.902 | 0.7992 | 1.4274 | 1.2341 | | pytorch_unet | 1 | 0.9994 | 0.211 | 0.0 | 0.0 | 1.398 | 1.3647 | | timm_vovnet | 32 | 0.921 | 0.8917 | 0.8566 | 0.8015 | 1.3123 | 1.1448 | | vgg16 | 64 | 0.9996 | 0.9975 | 0.8574 | 0.9696 | 1.2731 | 1.2656 | | Super_SloMo | 6 | 0.9996 | 0.1756 | 0.0 | 0.0 | 1.2354 | 1.2009 | | alexnet | 128 | 0.9995 | 0.9974 | 0.8149 | 0.9304 | 1.2127 | 1.2089 | | hf_Reformer | 4 | 0.9983 | 1.0001 | 0.9929 | 0.6467 | 1.1771 | 1.1798 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9909 | 0.0 | 0.0 | 1.1572 | 1.1379 | | Background_Matting | 4 | 0.9996 | 0.1445 | 0.0 | 0.0 | 1.1432 | 1.126 | | yolov3 | 16 | 0.9995 | 0.9902 | 0.8025 | 0.0 | 1.0901 | 1.0664 | | tts_angular | 64 | 0.9856 | 0.9327 | 0.9887 | 0.9713 | 1.0178 | 1.0277 | | demucs | 4 | 1.0008 | 1.0014 | 0.9986 | 1.0013 | 1.0007 | 1.0004 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9966 | 0.6968 | 1.0073 | 0.9893 | 1.0318 | | hf_GPT2_large | 4 | 1.0004 | 0.9906 | 0.0 | 0.0 | 0.0 | 1.8638 | | tacotron2 | 64 | 0.9837 | 0.7671 | 1.0043 | 0.6138 | 0.0 | 0.8858 | | dlrm | 2048 | 1.0398 | 1.241 | 0.0 | 1.0132 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9787 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | fail_to_run | fail_to_run | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.1052 | 8.3704 | 12.0173 | nan | 410.1974 | 404.9732 | | timm_efficientdet | 1 | 20.3235 | 37.9004 | 77.6229 | nan | 144.1048 | 140.1487 | | hf_T5_large | 2 | 14.3488 | 39.2614 | nan | nan | 127.766 | 123.869 | | timm_vision_transformer_large | 8 | 3.1371 | 15.381 | nan | nan | 73.287 | 69.9113 | | resnet152 | 32 | 2.7861 | 13.4418 | 21.9886 | nan | 53.1665 | 51.4231 | | densenet121 | 4 | 2.514 | 12.1167 | 19.3687 | 237.3081 | 50.8713 | 50.4114 | | timm_resnest | 32 | 0.655 | 2.5872 | 3.869 | 67.7998 | 38.601 | 38.3417 | | timm_vision_transformer | 8 | 1.042 | 4.6354 | 6.7284 | 86.4047 | 35.3265 | 34.0328 | | hf_Bart | 4 | 2.1095 | 9.2445 | 13.9295 | nan | 33.4661 | 32.3341 | | BERT_pytorch | 16 | 1.8987 | 7.6899 | 11.4702 | 104.7828 | 32.6041 | 32.4238 | | timm_nfnet | 128 | 2.2086 | 7.2325 | 11.0375 | 162.9671 | 32.3426 | 31.1816 | | attention_is_all_you_need_pytorch | 256 | 1.4709 | 7.0425 | 11.1836 | nan | 32.1097 | 31.2503 | | hf_T5 | 8 | 2.7062 | 9.1489 | nan | 94.8588 | 30.0138 | 28.9176 | | fastNLP_Bert | 6 | 1.8527 | 7.3776 | 11.9634 | nan | 29.4152 | 27.2443 | | timm_regnet | 32 | 2.5002 | 8.2936 | 19.7926 | 147.3212 | 28.6761 | 27.3342 | | pytorch_stargan | 16 | 0.4672 | 2.0194 | 2.9525 | nan | 27.5278 | 26.1851 | | speech_transformer | 32 | 1.9852 | 8.8325 | 33.4276 | 187.9001 | 27.0422 | 25.9075 | | timm_efficientnet | 32 | 1.9247 | 6.8598 | 15.6129 | 148.5563 | 26.8783 | 26.3009 | | mobilenet_v3_large | 32 | 1.0461 | 4.8898 | 7.385 | 118.1306 | 26.1886 | 25.8204 | | pytorch_struct | 200 | 0.2794 | 0.8629 | 1.4933 | 7.9045 | 23.5033 | 21.4009 | | functorch_dp_cifar10 | 64 | 0.3331 | 1.4212 | 2.1122 | nan | 22.9293 | 22.0858 | | mnasnet1_0 | 32 | 0.9643 | 4.4323 | 6.6477 | 89.2708 | 21.5553 | 20.4725 | | hf_Bert | 4 | 1.9087 | 7.1292 | 10.4157 | nan | 21.4173 | 20.8345 | | Super_SloMo | 6 | 1.2017 | 7.3371 | nan | nan | 21.3492 | 21.1803 | | shufflenet_v2_x1_0 | 128 | 1.0964 | 5.0954 | 7.6381 | 106.9734 | 20.6302 | 20.9081 | | resnet50 | 32 | 1.0286 | 4.5371 | 6.9953 | 98.288 | 20.5563 | 19.8446 | | hf_GPT2 | 4 | 1.7483 | 6.4224 | 9.8483 | 86.9399 | 20.482 | 19.2882 | | resnext50_32x4d | 8 | 1.0352 | 4.6663 | 6.8671 | 83.0128 | 20.4413 | 19.8569 | | hf_Albert | 8 | 1.6393 | 6.5756 | 10.3808 | nan | 20.3446 | 19.6312 | | timm_vovnet | 32 | 1.5852 | 4.5019 | 9.8763 | 70.2608 | 20.1982 | 19.3321 | | Background_Matting | 4 | 1.0537 | 8.9189 | nan | nan | 19.8374 | 19.4347 | | mobilenet_v2 | 96 | 0.9474 | 4.6363 | 7.0787 | 116.9942 | 19.5969 | 18.8624 | | hf_Reformer | 4 | 1.7119 | 2.9586 | 5.4457 | 17.1233 | 18.2696 | 15.8399 | | hf_DistilBert | 8 | 0.8299 | 3.4588 | 5.983 | 57.064 | 13.9215 | 13.2452 | | resnet18 | 16 | 0.4753 | 1.8325 | 2.6318 | 40.2048 | 11.508 | 11.182 | | dcgan | 32 | 0.1849 | 0.432 | 0.6801 | 5.1007 | 10.4021 | 10.1871 | | pytorch_unet | 1 | 0.5203 | 3.0404 | nan | nan | 9.8418 | 9.5292 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4727 | 2.0745 | 2.9088 | nan | 9.1528 | 8.887 | | LearningToPaint | 96 | 0.4979 | 1.9157 | 2.8968 | 47.3778 | 8.0722 | 8.247 | | squeezenet1_1 | 32 | 0.2732 | 0.9435 | 1.4106 | 6.9494 | 4.6896 | 4.5055 | | drq | 1 | 0.3366 | 0.6504 | 1.0479 | 6.3413 | 4.3637 | 3.6185 | | vgg16 | 64 | 0.2044 | 0.6667 | 1.1069 | 5.4917 | 4.3143 | 4.0741 | | nvidia_deeprecommender | 256 | 0.2258 | 0.5189 | 0.8371 | 5.1664 | 3.5683 | 3.362 | | soft_actor_critic | 256 | 0.2238 | 0.3729 | 0.5466 | 2.8581 | 3.4488 | 2.9327 | | alexnet | 128 | 0.1737 | 0.4517 | 0.7708 | 4.8236 | 3.4121 | 3.4387 | | lennard_jones | 1000 | 0.1594 | 0.352 | 0.5331 | 2.9985 | 2.1356 | 1.9516 | | tts_angular | 64 | 0.1941 | 0.2453 | 0.3712 | 1.5011 | 1.8735 | 1.7029 | | demucs | 4 | 0.3572 | 0.3692 | 0.3617 | 0.3723 | 0.2662 | 0.2697 | | hf_GPT2_large | 4 | 5.8619 | 20.0874 | nan | nan | nan | 54.8531 | | tacotron2 | 64 | 5.0773 | 17.9151 | 32.8942 | 90.2039 | nan | 42.9043 | | dlrm | 2048 | 0.4764 | 0.8724 | nan | 4.5343 | nan | nan | | hf_BigBird | 2 | 3.9804 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2716 | 0.4638 | 1.2042 | 1.2318 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9942 | 0.9812 | 0.3345 | 1.1938 | 1.093 | 1.0972 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3557 | 0.4815 | 1.0334 | 1.1302 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.025 | 1.176 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3085 | nan | 0.9991 | 1.0312 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.3999 | 1.1126 | 0.9743 | 1.1226 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.3799 | 1.1204 | 0.9649 | 1.1241 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9647 | 1.2945 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.999 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9344 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.9309 | 1.252 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3631 | nan | 0.9123 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.9113 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.9063 | 1.0466 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8357 | nan | nan | 0.879 | 0.9541 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3902 | nan | 0.8735 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3559 | 0.7806 | 0.8659 | 0.885 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8657 | 1.0681 | | Background_Matting | 4 | 1.0138 | 0.6522 | nan | nan | 0.8561 | 1.0426 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3413 | 1.0708 | 0.8384 | 0.9048 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.8231 | 1.0097 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3447 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3409 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9728 | 0.4253 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.389 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3776 | 0.734 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3407 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.7299 | 0.925 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.392 | 1.0881 | 0.7151 | 0.7249 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0017 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4478 | 0.4688 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 184.5443 | 186.3302 | nan | nan | 159.223 | 161.8698 | | Background_Matting | 4 | 134.8497 | 924.794 | nan | nan | 117.085 | 118.6397 | | hf_T5 | 8 | 174.7034 | 188.0951 | nan | 129.1714 | 93.2561 | 93.1601 | | hf_T5_large | 2 | 213.7842 | 256.7219 | nan | nan | 88.5949 | 109.3668 | | timm_nfnet | 128 | 131.4535 | 131.9692 | 150.0397 | 143.1163 | 87.3883 | 91.6715 | | hf_Reformer | 4 | 82.2837 | 82.3273 | 82.9056 | 127.5164 | 69.9374 | 69.6825 | | Super_SloMo | 6 | 79.5798 | 451.9522 | nan | nan | 64.4157 | 66.1253 | | yolov3 | 16 | 68.5586 | 69.5662 | 85.5946 | nan | 63.0349 | 64.5218 | | demucs | 4 | 58.455 | 57.0518 | 57.168 | 57.1129 | 56.9642 | 57.3555 | | timm_regnet | 32 | 74.4033 | 83.0492 | 81.166 | 102.0063 | 55.0966 | 59.0269 | | vgg16 | 64 | 66.4376 | 66.2959 | 77.4475 | 68.2366 | 52.0935 | 52.5242 | | resnet152 | 32 | 94.9055 | 93.1869 | 74.0984 | nan | 46.3811 | 73.1892 | | speech_transformer | 32 | 59.7439 | 73.0754 | 34.2366 | 96.0289 | 41.4274 | 39.9127 | | fastNLP_Bert | 6 | 55.8705 | 62.8321 | 73.6432 | nan | 36.6969 | 37.9577 | | timm_efficientdet | 1 | 165.6189 | 199.2648 | 78.6469 | nan | 36.2198 | 108.9617 | | attention_is_all_you_need_pytorch | 256 | 53.066 | 56.6785 | 63.2033 | nan | 33.4294 | 34.6538 | | hf_Bart | 4 | 67.7128 | 69.3567 | 65.4843 | nan | 33.2851 | 34.8298 | | mobilenet_v2 | 96 | 48.9138 | 49.4806 | 64.6099 | 46.9538 | 31.4528 | 32.2349 | | hf_Albert | 8 | 68.408 | 71.5163 | 88.4786 | nan | 28.7539 | 29.5093 | | pytorch_unet | 1 | 40.0462 | 189.7521 | nan | nan | 28.5928 | 29.3082 | | hf_GPT2 | 4 | 48.1619 | 50.1525 | 60.5955 | 169.4284 | 25.6744 | 25.8772 | | timm_vovnet | 32 | 34.5721 | 35.9825 | 37.3822 | 39.6921 | 24.9173 | 28.2681 | | shufflenet_v2_x1_0 | 128 | 40.8439 | 39.9565 | 42.0468 | 48.9605 | 24.3844 | 31.4407 | | timm_efficientnet | 32 | 48.0249 | 57.7823 | 43.4258 | 69.485 | 22.6902 | 37.4491 | | hf_Bert | 4 | 40.3979 | 48.7983 | 44.1996 | nan | 21.0746 | 23.957 | | hf_DistilBert | 8 | 31.094 | 32.0647 | 41.8717 | 85.693 | 20.5391 | 20.969 | | resnet50 | 32 | 35.2558 | 32.4236 | 32.6283 | 42.6581 | 19.4118 | 25.7889 | | BERT_pytorch | 16 | 67.3953 | 66.0046 | 35.225 | 71.4504 | 16.4701 | 24.2547 | | densenet121 | 4 | 78.2573 | 81.3114 | 30.7293 | 102.0962 | 13.3644 | 58.5489 | | timm_resnest | 32 | 24.7072 | 24.1824 | 29.3934 | 26.1366 | 12.875 | 14.7184 | | mobilenet_v3_large | 32 | 35.2822 | 36.0569 | 24.1346 | 46.6719 | 12.0437 | 28.3887 | | mnasnet1_0 | 32 | 29.3609 | 28.9724 | 23.4895 | 39.4494 | 11.56 | 22.093 | | pytorch_stargan | 16 | 16.2005 | 14.8421 | 15.5879 | nan | 10.9592 | 11.4609 | | nvidia_deeprecommender | 256 | 10.4003 | 10.4398 | 14.9067 | 10.3311 | 10.4852 | 10.0698 | | timm_vision_transformer | 8 | 29.6187 | 35.2927 | 16.712 | 50.5174 | 9.7945 | 19.8855 | | resnext50_32x4d | 8 | 28.3059 | 30.4585 | 15.6588 | 39.9759 | 8.5396 | 23.8969 | | LearningToPaint | 96 | 14.9131 | 14.7165 | 12.8123 | 19.6985 | 8.2067 | 12.5766 | | alexnet | 128 | 9.8499 | 9.8774 | 12.0883 | 10.5935 | 8.1117 | 8.1483 | | tts_angular | 64 | 6.4304 | 6.8481 | 6.4013 | 6.7183 | 6.7367 | 6.5598 | | pytorch_CycleGAN_and_pix2pix | 1 | 17.4012 | 18.4897 | 10.2992 | nan | 6.6974 | 11.7395 | | squeezenet1_1 | 32 | 15.026 | 15.2987 | 10.1964 | 20.8274 | 6.2337 | 11.8277 | | resnet18 | 16 | 12.9573 | 12.9436 | 8.1289 | 16.6912 | 4.7535 | 10.624 | | functorch_dp_cifar10 | 64 | 13.7894 | 14.9443 | 5.9598 | nan | 2.9759 | 10.8928 | | pytorch_struct | 200 | 4.5649 | 6.7529 | 4.583 | 7.6035 | 2.2624 | 3.7671 | | drq | 1 | 3.8306 | 4.9238 | 1.9456 | 7.167 | 1.3208 | 3.5583 | | dcgan | 32 | 3.4934 | 3.4131 | 1.9089 | 4.577 | 1.1255 | 3.063 | | soft_actor_critic | 256 | 1.3786 | 1.8817 | 1.1665 | 2.9966 | 0.8535 | 1.4134 | | lennard_jones | 1000 | 1.4907 | 1.919 | 1.1787 | 3.3316 | 0.753 | 1.4949 | | tacotron2 | 64 | 3327.3997 | 3955.4932 | 3501.0478 | 5318.0273 | nan | 3529.1918 | | hf_GPT2_large | 4 | 209.7612 | 211.9608 | nan | nan | nan | 112.9008 | | dlrm | 2048 | 494.701 | 461.9965 | nan | 504.8439 | nan | nan | | hf_BigBird | 2 | 192.3486 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0219 | 0.8515 | 2.3046 | 0.0 | 4.8043 | 1.6603 | | MobileBertForMaskedLM | 32 | 1.0176 | 0.838 | 1.7401 | 0.0 | 4.5374 | 1.7092 | | MobileBertForQuestionAnswering | 64 | 1.0179 | 0.8384 | 1.487 | 0.0 | 4.4294 | 1.6996 | | CamemBert | 1 | 1.0388 | 0.8591 | 1.7357 | 0.0 | 3.7037 | 1.8027 | | MT5ForConditionalGeneration | 8 | 1.0173 | 0.8638 | 1.4969 | 0.8571 | 3.6268 | 2.493 | | DistillGPT2 | 1 | 1.0245 | 0.8757 | 1.2387 | 0.0 | 2.6873 | 2.0199 | | GPT2ForSequenceClassification | 4 | 1.0009 | 0.97 | 0.0 | 0.5011 | 2.3485 | 2.2924 | | M2M100ForConditionalGeneration | 8 | 1.0075 | 0.8492 | 1.2232 | 0.6499 | 2.3261 | 1.7679 | | ElectraForQuestionAnswering | 64 | 1.0007 | 0.9793 | 0.7692 | 0.0 | 2.1233 | 2.0657 | | MegatronBertForQuestionAnswering | 16 | 1.0326 | 0.8717 | 1.0267 | 0.0 | 2.0405 | 1.802 | | PLBartForConditionalGeneration | 16 | 1.0156 | 0.8455 | 1.034 | 0.0 | 1.9868 | 1.7464 | | PegasusForConditionalGeneration | 16 | 1.0081 | 0.8326 | 0.9209 | 0.6124 | 1.9629 | 1.5811 | | XGLMForCausalLM | 8 | 1.0121 | 0.8305 | 0.9369 | 0.0 | 1.9346 | 1.725 | | MegatronBertForCausalLM | 16 | 1.0333 | 0.8736 | 0.9614 | 0.0 | 1.8948 | 1.7878 | | MBartForConditionalGeneration | 16 | 1.0144 | 0.8428 | 0.9004 | 0.6301 | 1.8743 | 1.607 | | LayoutLMForSequenceClassification | 16 | 1.0005 | 0.9802 | 0.7763 | 0.0 | 1.8655 | 1.8155 | | ElectraForCausalLM | 32 | 1.0005 | 0.9431 | 0.7171 | 0.0 | 1.8356 | 1.8305 | | AlbertForQuestionAnswering | 4 | 0.9997 | 0.8858 | 0.0 | 0.0 | 1.6703 | 1.6597 | | T5Small | 1 | 1.0267 | 0.8986 | 1.1453 | 0.8438 | 1.6696 | 1.4693 | | AlbertForMaskedLM | 4 | 1.0002 | 0.8853 | 0.0 | 0.0 | 1.6624 | 1.6469 | | LayoutLMForMaskedLM | 16 | 1.0007 | 0.9714 | 0.7564 | 0.0 | 1.6531 | 1.6301 | | T5ForConditionalGeneration | 4 | 1.0069 | 0.9202 | 0.7578 | 1.1617 | 1.616 | 1.6827 | | Speech2Text2ForCausalLM | 128 | 1.0033 | 0.9294 | 0.7193 | 0.794 | 1.5908 | 1.647 | | OPTForCausalLM | 32 | 1.0149 | 0.9294 | 0.7754 | 0.3293 | 1.5704 | 1.592 | | DistilBertForQuestionAnswering | 64 | 0.9995 | 0.969 | 0.7437 | 0.34 | 1.4979 | 1.4483 | | BertForQuestionAnswering | 128 | 0.9999 | 0.9848 | 0.7788 | 0.0 | 1.4973 | 1.4721 | | RobertaForQuestionAnswering | 128 | 0.9999 | 0.981 | 0.7793 | 0.0 | 1.4904 | 1.4651 | | BartForConditionalGeneration | 2 | 1.0053 | 0.968 | 0.0 | 0.0 | 1.4649 | 1.4254 | | BartForCausalLM | 4 | 1.0011 | 0.9686 | 0.7583 | 0.0 | 1.4441 | 1.4438 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0074 | 0.9258 | 0.7314 | 0.0 | 1.4422 | 1.4426 | | RobertaForCausalLM | 64 | 1.0001 | 0.959 | 0.7472 | 0.0 | 1.4395 | 1.4302 | | BertForMaskedLM | 64 | 1.0001 | 0.9583 | 0.7402 | 0.0 | 1.3501 | 1.336 | | DebertaForMaskedLM | 4 | 0.9145 | 0.7254 | 0.7862 | 0.0 | 1.2985 | 1.1353 | | PLBartForCausalLM | 32 | 1.0057 | 0.9427 | 0.793 | 0.8407 | 1.2933 | 1.2895 | | BlenderbotSmallForCausalLM | 64 | 1.002 | 0.9297 | 0.7162 | 0.0 | 1.269 | 1.2847 | | DistilBertForMaskedLM | 64 | 1.0001 | 0.9526 | 0.7096 | 0.467 | 1.2677 | 1.2675 | | MBartForCausalLM | 32 | 1.0027 | 0.949 | 0.755 | 0.8571 | 1.2022 | 1.1958 | | TrOCRForCausalLM | 32 | 1.0035 | 0.9494 | 0.7572 | 0.0 | 1.1985 | 1.1893 | | PegasusForCausalLM | 32 | 1.0004 | 0.954 | 0.7504 | 0.8595 | 1.1856 | 1.1912 | | DebertaForQuestionAnswering | 8 | 0.9734 | 0.8864 | 0.722 | 0.0 | 1.142 | 1.2134 | | BigBird | 1 | 0.9822 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.1094 | 10.9849 | 35.4218 | nan | 100.6498 | 39.3602 | | DebertaForMaskedLM | 4 | 5.0069 | 10.971 | 35.4743 | nan | 98.2334 | 38.4765 | | XGLMForCausalLM | 8 | 3.166 | 13.6443 | 28.3232 | nan | 79.5643 | 75.9818 | | MobileBertForMaskedLM | 32 | 10.2549 | 32.5129 | 57.9007 | nan | 73.0495 | 70.3473 | | MobileBertForQuestionAnswering | 64 | 10.0451 | 32.4485 | 58.3949 | nan | 71.3702 | 69.4718 | | M2M100ForConditionalGeneration | 8 | 4.3457 | 16.8933 | 28.6187 | 367.1444 | 68.3941 | 66.05 | | MBartForConditionalGeneration | 16 | 3.947 | 17.4759 | 28.8296 | 403.1541 | 56.7113 | 53.9014 | | PegasusForConditionalGeneration | 16 | 3.8184 | 16.9218 | 27.5504 | 370.129 | 56.0866 | 51.369 | | BartForConditionalGeneration | 2 | 3.9359 | 17.3592 | nan | nan | 54.5033 | 53.2429 | | YituTechConvBert | 1 | 2.831 | 10.7535 | 16.2706 | nan | 51.4981 | 48.7352 | | MegatronBertForCausalLM | 16 | 4.0051 | 14.1851 | 22.3966 | nan | 44.0048 | 43.6264 | | MegatronBertForQuestionAnswering | 16 | 4.0426 | 14.3897 | 22.0127 | nan | 43.1248 | 43.4233 | | MT5ForConditionalGeneration | 8 | 3.8871 | 12.7234 | 21.0245 | 157.9007 | 39.5629 | 38.6649 | | BlenderbotSmallForConditionalGeneration | 64 | 2.5458 | 11.6501 | 18.2531 | nan | 36.9742 | 35.6609 | | T5Small | 1 | 2.7257 | 8.6484 | 12.8905 | 96.7919 | 32.042 | 31.2433 | | PLBartForConditionalGeneration | 16 | 2.0969 | 8.6645 | 13.3667 | nan | 31.0124 | 30.6241 | | T5ForConditionalGeneration | 4 | 2.7739 | 8.7351 | 13.5988 | 92.9152 | 30.3169 | 29.9496 | | LayoutLMForSequenceClassification | 16 | 2.2776 | 7.5236 | 11.4095 | nan | 29.105 | 28.0431 | | ElectraForCausalLM | 32 | 1.9615 | 7.1288 | 11.0344 | nan | 28.1203 | 25.7585 | | PegasusForCausalLM | 32 | 1.5503 | 6.5904 | 10.4812 | 107.1259 | 25.9792 | 23.7474 | | MBartForCausalLM | 32 | 1.5305 | 6.4168 | 9.6023 | 117.0933 | 24.4754 | 23.5407 | | LayoutLMForMaskedLM | 16 | 2.2949 | 7.5312 | 11.6996 | nan | 23.6255 | 22.6189 | | TrOCRForCausalLM | 32 | 1.4726 | 6.741 | 9.8864 | nan | 23.4104 | 22.4791 | | BartForCausalLM | 4 | 1.5476 | 6.6415 | 9.8115 | nan | 23.2848 | 22.7779 | | RobertaForCausalLM | 64 | 1.8851 | 7.1386 | 11.0732 | nan | 23.2263 | 21.7646 | | OPTForCausalLM | 32 | 1.6163 | 6.5259 | 11.3955 | 105.4353 | 23.1004 | 22.2308 | | BertForMaskedLM | 64 | 1.8608 | 6.9881 | 10.7779 | nan | 22.6446 | 21.8753 | | ElectraForQuestionAnswering | 64 | 1.9874 | 7.0971 | 11.0107 | nan | 22.3043 | 21.6035 | | BertForQuestionAnswering | 128 | 1.942 | 7.0929 | 10.4832 | nan | 21.6413 | 20.8908 | | RobertaForQuestionAnswering | 128 | 1.9086 | 7.4145 | 10.7447 | nan | 20.8637 | 20.0173 | | CamemBert | 1 | 1.9635 | 7.097 | 10.3388 | nan | 20.8201 | 20.1427 | | GPT2ForSequenceClassification | 4 | 1.8769 | 6.84 | nan | 88.3981 | 20.1701 | 19.2186 | | AlbertForMaskedLM | 4 | 1.7912 | 6.8553 | nan | nan | 19.635 | 18.6826 | | AlbertForQuestionAnswering | 4 | 1.7313 | 6.765 | nan | nan | 19.3243 | 18.0536 | | BlenderbotSmallForCausalLM | 64 | 1.0092 | 4.287 | 6.4266 | nan | 17.2191 | 16.8708 | | Speech2Text2ForCausalLM | 128 | 0.899 | 3.4309 | 5.9088 | 52.4149 | 16.351 | 14.7639 | | PLBartForCausalLM | 32 | 0.8533 | 3.3757 | 4.9986 | 67.2627 | 15.1941 | 14.9019 | | DistillGPT2 | 1 | 0.9802 | 3.228 | 4.7286 | nan | 14.1498 | 13.7131 | | DistilBertForMaskedLM | 64 | 0.7982 | 3.471 | 5.7162 | 56.3105 | 13.32 | 12.6939 | | DistilBertForQuestionAnswering | 64 | 0.8486 | 3.535 | 6.1513 | 56.5856 | 12.5811 | 12.0996 | | BigBird | 1 | 4.0718 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.2229 | 1.0775 | 1.1712 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 1.0109 | 1.0722 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9837 | 1.1976 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1321 | 0.9708 | 1.0342 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9978 | 0.965 | 1.1391 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | 1.1057 | 0.9417 | 1.0114 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.9293 | 0.9793 | | RobertaForCausalLM | 64 | 1.0 | 0.8993 | 0.3788 | nan | 0.9289 | 0.9789 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.1114 | 0.9267 | 1.0655 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3725 | 1.0333 | 0.9251 | 1.0063 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | 0.9984 | 0.9218 | 1.0986 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.921 | 0.9877 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9628 | 0.4377 | 1.1462 | 0.9159 | 1.0993 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0179 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0276 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4145 | nan | 0.8843 | 1.0284 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | 1.0791 | 0.8802 | 0.948 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.8751 | 0.919 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.869 | 0.949 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.856 | 0.9328 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.8549 | 0.9361 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | CamemBert | 1 | 0.999 | 0.8143 | 0.416 | nan | 0.8061 | 0.9309 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9902 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.4019 | nan | 0.8046 | 1.024 | | YituTechConvBert | 1 | 0.9718 | 0.868 | 0.4314 | nan | 0.791 | 0.9314 | | M2M100ForConditionalGeneration | 8 | 0.9818 | 0.9468 | 0.4275 | 1.0397 | 0.7585 | 1.0035 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.6698 | 0.9454 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.6085 | 0.8221 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3623 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3252 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.974 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.9808 | 301.4476 | nan | nan | 160.6437 | 162.3453 | | AlbertForQuestionAnswering | 4 | 264.7338 | 298.9101 | nan | nan | 158.656 | 159.8294 | | BartForConditionalGeneration | 2 | 135.5512 | 140.5666 | nan | nan | 93.5199 | 95.6671 | | BartForCausalLM | 4 | 112.5075 | 114.2754 | 148.2553 | nan | 77.8686 | 77.8044 | | BlenderbotSmallForConditionalGeneration | 64 | 109.8198 | 121.1354 | 150.3638 | nan | 76.9849 | 76.8882 | | RobertaForQuestionAnswering | 128 | 111.4452 | 114.5657 | 142.6016 | nan | 74.7644 | 75.9262 | | BertForQuestionAnswering | 128 | 110.6958 | 112.2417 | 142.1188 | nan | 73.8936 | 75.0982 | | LayoutLMForMaskedLM | 16 | 111.9472 | 115.1884 | 148.1229 | nan | 67.9049 | 68.7544 | | DebertaForQuestionAnswering | 8 | 76.9812 | 84.723 | 104.2131 | nan | 65.7909 | 61.8723 | | MBartForConditionalGeneration | 16 | 102.2446 | 123.5719 | 114.2392 | 162.9772 | 64.7489 | 68.7878 | | PegasusForConditionalGeneration | 16 | 121.0961 | 123.9568 | 112.8016 | 167.5328 | 64.7003 | 70.7149 | | T5ForConditionalGeneration | 4 | 101.9715 | 110.3414 | 134.4927 | 87.407 | 62.8499 | 60.3223 | | PegasusForCausalLM | 32 | 69.0955 | 72.6173 | 91.8954 | 81.3662 | 58.6372 | 58.8657 | | TrOCRForCausalLM | 32 | 69.7905 | 73.6147 | 92.1479 | nan | 58.477 | 58.4341 | | MBartForCausalLM | 32 | 69.4398 | 73.3261 | 92.3397 | 81.3989 | 58.3927 | 58.3606 | | RobertaForCausalLM | 64 | 80.5361 | 84.0085 | 108.1013 | nan | 56.1877 | 56.2222 | | BertForMaskedLM | 64 | 75.5677 | 78.8615 | 102.0918 | nan | 56.1156 | 56.4891 | | ElectraForQuestionAnswering | 64 | 115.5134 | 116.944 | 150.2581 | nan | 53.9129 | 55.5586 | | LayoutLMForSequenceClassification | 16 | 97.2564 | 99.0974 | 125.3579 | nan | 52.272 | 53.4826 | | XGLMForCausalLM | 8 | 101.3239 | 105.4167 | 93.9126 | nan | 51.7708 | 60.7491 | | MobileBertForQuestionAnswering | 64 | 178.0668 | 212.3448 | 118.8405 | nan | 50.6934 | 135.9008 | | DebertaForMaskedLM | 4 | 68.668 | 84.9158 | 78.6144 | nan | 49.8328 | 55.835 | | M2M100ForConditionalGeneration | 8 | 116.8429 | 125.5647 | 88.5412 | 163.7432 | 48.4954 | 64.519 | | ElectraForCausalLM | 32 | 87.2679 | 92.5559 | 121.8527 | nan | 47.6653 | 47.8105 | | BlenderbotSmallForCausalLM | 64 | 58.8255 | 63.0493 | 81.7305 | nan | 46.4031 | 45.9116 | | MegatronBertForCausalLM | 16 | 78.8458 | 94.4259 | 84.1221 | nan | 45.8633 | 49.0138 | | MegatronBertForQuestionAnswering | 16 | 77.5054 | 93.3606 | 76.8818 | nan | 42.0835 | 51.9932 | | MobileBertForMaskedLM | 32 | 215.8254 | 207.9886 | 99.2902 | nan | 41.6477 | 131.5014 | | GPT2ForSequenceClassification | 4 | 91.3362 | 93.8195 | nan | 183.8734 | 39.3454 | 39.8813 | | T5Small | 1 | 61.7367 | 68.8572 | 53.267 | 74.8019 | 39.1405 | 42.9713 | | DistilBertForMaskedLM | 64 | 45.3527 | 47.4609 | 63.793 | 96.9768 | 35.8 | 35.7405 | | OPTForCausalLM | 32 | 53.9712 | 57.9892 | 70.0301 | 163.8954 | 34.7516 | 35.5696 | | PLBartForCausalLM | 32 | 39.4137 | 41.4402 | 49.5845 | 46.1101 | 30.6265 | 30.6572 | | PLBartForConditionalGeneration | 16 | 54.9883 | 64.9975 | 53.5031 | nan | 29.6329 | 38.6104 | | MT5ForConditionalGeneration | 8 | 85.9762 | 100.8147 | 58.2551 | 101.0261 | 25.8218 | 37.2866 | | DistilBertForQuestionAnswering | 64 | 30.5802 | 31.5177 | 41.1785 | 90.1738 | 20.4367 | 21.0237 | | Speech2Text2ForCausalLM | 128 | 30.2609 | 32.8901 | 42.4344 | 37.3993 | 19.4509 | 20.074 | | YituTechConvBert | 1 | 63.0403 | 74.5824 | 27.3631 | nan | 13.8039 | 40.4474 | | CamemBert | 1 | 38.1446 | 44.9047 | 21.8965 | nan | 11.1242 | 22.6147 | | DistillGPT2 | 1 | 21.8047 | 23.1541 | 16.079 | nan | 7.9339 | 11.4962 | | BigBird | 1 | 193.7751 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0014 | 0.0 | 0.0 | 0.0 | 2.5664 | 1.8562 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9985 | 0.0 | 0.0 | 2.1314 | 2.0926 | | ghostnet_100 | 128 | 1.0041 | 0.977 | 0.8831 | 1.0086 | 2.0983 | 1.7902 | | regnety_002 | 128 | 0.9802 | 0.9424 | 1.1124 | 0.8553 | 2.0911 | 1.4394 | | lcnet_050 | 128 | 0.9679 | 0.9481 | 0.8462 | 1.0077 | 2.0149 | 1.6139 | | twins_pcpvt_base | 64 | 1.0061 | 0.9394 | 0.9237 | 0.0 | 1.7411 | 1.6753 | | hrnet_w18 | 128 | 1.0032 | 1.0375 | 0.8634 | 0.0 | 1.7011 | 1.4732 | | coat_lite_mini | 128 | 1.0005 | 0.9954 | 0.8475 | 1.1326 | 1.641 | 1.6321 | | res2net101_26w_4s | 64 | 1.0039 | 1.0023 | 1.0252 | 0.0 | 1.608 | 1.3248 | | volo_d1_224 | 64 | 0.9999 | 0.9934 | 0.8455 | 0.0 | 1.6032 | 1.5655 | | resnest101e | 64 | 0.9996 | 0.99 | 0.8086 | 0.0 | 1.5947 | 1.4267 | | dla102 | 128 | 1.0 | 0.9959 | 0.8369 | 1.3129 | 1.58 | 1.5506 | | cait_m36_384 | 4 | 1.0003 | 0.9873 | 0.0 | 0.0 | 1.5716 | 1.5138 | | gmlp_s16_224 | 128 | 0.9999 | 0.9959 | 0.7868 | 1.0 | 1.5537 | 1.5484 | | gmixer_24_224 | 128 | 0.9997 | 0.8803 | 0.722 | 0.923 | 1.5444 | 1.4919 | | nfnet_l0 | 128 | 0.9999 | 0.8101 | 0.713 | 0.8485 | 1.5401 | 1.4704 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9581 | 0.0 | 0.0 | 1.5166 | 1.5276 | | adv_inception_v3 | 128 | 1.0 | 0.9966 | 0.845 | 1.1423 | 1.505 | 1.4729 | | gluon_inception_v3 | 128 | 1.0 | 0.9958 | 0.8528 | 1.141 | 1.5032 | 1.4695 | | dm_nfnet_f0 | 128 | 0.9993 | 0.9992 | 0.875 | 0.923 | 1.503 | 1.4288 | | inception_v3 | 128 | 0.9999 | 0.996 | 0.8424 | 1.1429 | 1.5021 | 1.4674 | | res2net50_14w_8s | 128 | 1.0005 | 0.9925 | 0.809 | 0.9859 | 1.4766 | 1.4033 | | mobilenetv3_large_100 | 128 | 0.955 | 0.9453 | 0.7814 | 0.9872 | 1.4491 | 1.4362 | | crossvit_9_240 | 128 | 1.0002 | 0.9945 | 0.8362 | 0.9166 | 1.4455 | 1.4163 | | selecsls42b | 128 | 0.9996 | 0.9955 | 0.8398 | 1.2825 | 1.4432 | 1.4124 | | mnasnet_100 | 128 | 0.9524 | 0.9447 | 0.7867 | 1.1844 | 1.4258 | 1.4586 | | resmlp_12_224 | 128 | 1.0004 | 0.9989 | 0.782 | 1.4864 | 1.4256 | 1.4119 | | res2next50 | 128 | 0.9998 | 0.9954 | 0.832 | 1.1458 | 1.4181 | 1.3459 | | fbnetv3_b | 128 | 0.9523 | 0.9417 | 0.7757 | 0.0 | 1.41 | 1.3987 | | mobilenetv2_100 | 128 | 0.9527 | 0.9418 | 0.7206 | 1.1255 | 1.4039 | 1.4317 | | jx_nest_base | 32 | 0.9997 | 0.9925 | 0.8002 | 0.0 | 1.4037 | 1.3672 | | mobilevit_s | 64 | 0.9735 | 0.8139 | 0.6555 | 0.0 | 1.389 | 1.3759 | | ese_vovnet19b_dw | 128 | 0.9703 | 0.9653 | 0.7669 | 1.1284 | 1.3711 | 1.3779 | | spnasnet_100 | 128 | 0.9464 | 0.937 | 0.778 | 1.0938 | 1.366 | 1.3964 | | convit_base | 64 | 0.9998 | 0.9965 | 0.8299 | 1.2221 | 1.366 | 1.3203 | | tf_efficientnet_b0 | 128 | 0.9655 | 0.808 | 0.6658 | 0.9373 | 1.3509 | 1.3555 | | fbnetc_100 | 128 | 0.9533 | 0.9429 | 0.7906 | 1.1656 | 1.3504 | 1.3768 | | pit_b_224 | 64 | 0.9997 | 0.9953 | 0.8215 | 0.972 | 1.3498 | 1.3453 | | botnet26t_256 | 128 | 0.9803 | 0.9758 | 0.8122 | 1.2787 | 1.3282 | 1.3329 | | poolformer_m36 | 64 | 0.9997 | 0.9984 | 0.8073 | 0.0 | 1.3276 | 1.2964 | | cspdarknet53 | 64 | 0.9431 | 0.9338 | 0.7543 | 1.0846 | 1.3079 | 1.3235 | | pnasnet5large | 16 | 1.0055 | 1.0333 | 0.8513 | 0.0 | 1.3064 | 1.2562 | | mixer_b16_224 | 128 | 1.0003 | 0.9976 | 0.802 | 0.901 | 1.2887 | 1.2766 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.992 | 0.7979 | 0.9741 | 1.2833 | 1.2628 | | beit_base_patch16_224 | 64 | 0.9999 | 0.978 | 0.0 | 0.0 | 1.278 | 1.2729 | | eca_botnext26ts_256 | 128 | 0.9808 | 0.8119 | 0.6709 | 1.067 | 1.2759 | 1.2714 | | rexnet_100 | 128 | 0.9653 | 0.8475 | 0.6895 | 0.0 | 1.2756 | 1.2781 | | tinynet_a | 128 | 0.9657 | 0.8049 | 0.6629 | 0.7912 | 1.2581 | 1.2676 | | visformer_small | 128 | 0.9999 | 1.0014 | 0.8379 | 0.0 | 1.2299 | 1.1687 | | sebotnet33ts_256 | 64 | 0.9671 | 0.8369 | 0.6791 | 0.971 | 1.1979 | 1.2027 | | vit_base_patch16_224 | 64 | 1.0001 | 0.9937 | 0.8352 | 0.9109 | 1.1967 | 1.1799 | | tf_mixnet_l | 128 | 0.9811 | 0.9091 | 0.7949 | 0.0 | 1.1791 | 1.1739 | | gluon_xception65 | 32 | 0.9996 | 0.9897 | 0.7403 | 0.0 | 1.1612 | 1.1259 | | mixnet_l | 128 | 0.9792 | 0.9052 | 0.7943 | 0.0 | 1.161 | 1.157 | | dpn107 | 32 | 0.9402 | 0.9247 | 0.7519 | 0.0 | 1.1539 | 1.1799 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9811 | 0.8039 | 0.0 | 1.1326 | 1.0573 | | repvgg_a2 | 128 | 0.9431 | 0.9348 | 0.7959 | 1.0717 | 1.1047 | 1.12 | | gernet_l | 128 | 0.9476 | 0.9381 | 0.7678 | 1.0632 | 1.067 | 1.0778 | | convmixer_768_32 | 32 | 1.0 | 0.9982 | 0.9227 | 0.0 | 1.0557 | 1.0508 | | convnext_base | 64 | 0.999 | 0.9947 | 0.7939 | 0.0 | 0.6647 | 0.6597 | | eca_halonext26ts | 128 | 0.9813 | 0.8156 | 0.6782 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.9459 | 31.1997 | 58.4777 | nan | 147.2362 | 132.9441 | | twins_pcpvt_base | 64 | 2.9719 | 15.1107 | 25.8914 | nan | 127.3752 | 125.1997 | | pnasnet5large | 16 | 5.2797 | 22.8864 | 41.6824 | nan | 90.6025 | 87.3532 | | xcit_large_24_p8_224 | 5 | 3.5867 | nan | nan | nan | 87.3497 | 82.1953 | | swin_base_patch4_window7_224 | 64 | 3.3683 | 13.2007 | nan | nan | 80.7758 | 79.1714 | | cait_m36_384 | 4 | 3.7647 | 19.4481 | nan | nan | 77.1027 | 77.1788 | | resnest101e | 64 | 3.7056 | 16.3301 | 27.8869 | nan | 75.967 | 72.8207 | | convnext_base | 64 | 1.6099 | 7.5203 | 12.0559 | nan | 74.0742 | 71.7899 | | mobilevit_s | 64 | 2.0254 | 7.6682 | 15.547 | nan | 69.0203 | 66.6449 | | jx_nest_base | 32 | 1.9678 | 9.047 | 15.2097 | nan | 65.7428 | 60.3427 | | res2net101_26w_4s | 64 | 3.8382 | 16.3009 | 28.6909 | nan | 62.687 | 59.32 | | coat_lite_mini | 128 | 1.164 | 5.2086 | 8.2313 | 115.599 | 59.481 | 63.473 | | res2net50_14w_8s | 128 | 3.199 | 14.5724 | 24.7141 | 343.0074 | 56.6981 | 54.3689 | | poolformer_m36 | 64 | 2.046 | 7.3563 | 12.3096 | nan | 52.8279 | 50.5782 | | sebotnet33ts_256 | 64 | 1.8108 | 6.3406 | 13.2313 | 149.9576 | 51.2354 | 46.1216 | | dpn107 | 32 | 4.1949 | 13.4387 | 39.2621 | nan | 46.5527 | 44.843 | | gmlp_s16_224 | 128 | 1.4232 | 7.4205 | 12.3781 | 192.1842 | 45.7715 | 46.6529 | | gluon_xception65 | 32 | 2.2728 | 11.1113 | 19.0372 | nan | 44.3725 | 42.6108 | | crossvit_9_240 | 128 | 1.8442 | 8.578 | 13.4832 | 209.6271 | 44.1671 | 39.715 | | fbnetv3_b | 128 | 3.4356 | 11.7237 | 28.3483 | nan | 44.1339 | 41.7885 | | volo_d1_224 | 64 | 1.4752 | 7.4875 | 12.5368 | nan | 42.5194 | 40.8009 | | tnt_s_patch16_224 | 128 | 2.0371 | 10.567 | nan | nan | 42.2081 | 38.7794 | | eca_botnext26ts_256 | 128 | 1.4137 | 4.8096 | 10.5781 | 122.1153 | 40.0027 | 39.4995 | | gluon_inception_v3 | 128 | 1.8184 | 8.3336 | 13.5165 | 183.7948 | 39.4389 | 36.5126 | | inception_v3 | 128 | 1.8355 | 8.4829 | 14.1661 | 180.5159 | 39.0393 | 36.3575 | | adv_inception_v3 | 128 | 1.8028 | 8.3913 | 14.2708 | 192.088 | 38.7681 | 37.0544 | | ghostnet_100 | 128 | 3.2758 | 9.8658 | 14.7149 | 198.1274 | 38.1693 | 36.3664 | | tf_mixnet_l | 128 | 5.9829 | 13.0113 | 27.2447 | nan | 37.98 | 35.3296 | | dla102 | 128 | 2.0976 | 9.5457 | 15.4588 | 243.5536 | 37.8889 | 35.407 | | swsl_resnext101_32x16d | 32 | 2.0872 | 9.5978 | 14.95 | nan | 37.2681 | 34.4429 | | mixnet_l | 128 | 5.8368 | 12.886 | 26.7032 | nan | 37.0197 | 34.5647 | | gmixer_24_224 | 128 | 1.5559 | 8.3214 | 13.833 | 186.6115 | 36.1304 | 34.253 | | botnet26t_256 | 128 | 1.4192 | 4.2514 | 9.1907 | 93.2166 | 34.3983 | 34.301 | | dm_nfnet_f0 | 128 | 2.2 | 7.2186 | 11.1495 | 165.4836 | 33.8793 | 31.6644 | | res2next50 | 128 | 1.7841 | 8.3589 | 12.979 | 199.8263 | 33.0534 | 30.4344 | | rexnet_100 | 128 | 2.1191 | 7.6466 | 17.1665 | nan | 31.2198 | 29.4947 | | tinynet_a | 128 | 2.2595 | 8.4071 | 20.1379 | 196.3846 | 31.0709 | 28.8351 | | convit_base | 64 | 1.2455 | 5.9242 | 10.1991 | 153.7663 | 30.6487 | 29.8927 | | spnasnet_100 | 128 | 2.142 | 6.6714 | 16.844 | 130.8637 | 27.2582 | 23.6541 | | cspdarknet53 | 64 | 2.428 | 7.5775 | 18.9454 | 152.2484 | 26.7424 | 25.302 | | tf_efficientnet_b0 | 128 | 2.07 | 7.0987 | 16.4148 | 181.1555 | 26.7206 | 25.0728 | | convmixer_768_32 | 32 | 1.3394 | 6.3698 | 10.1924 | nan | 26.0249 | 24.7827 | | mixer_b16_224 | 128 | 0.8074 | 3.7087 | 6.1783 | 85.842 | 25.7441 | 25.1425 | | fbnetc_100 | 128 | 2.1653 | 6.8418 | 17.6985 | 132.7798 | 25.6858 | 24.2034 | | visformer_small | 128 | 0.9501 | 3.9413 | 6.4373 | nan | 25.407 | 24.0502 | | pit_b_224 | 64 | 1.2474 | 5.2573 | 8.4407 | 112.5252 | 24.6971 | 23.4871 | | deit_base_distilled_patch16_224 | 64 | 1.0083 | 4.6871 | 7.2714 | 90.8766 | 24.6586 | 23.1078 | | nfnet_l0 | 128 | 1.9424 | 7.1988 | 10.8146 | 147.1702 | 24.2442 | 22.4526 | | resmlp_12_224 | 128 | 0.6663 | 3.062 | 4.8252 | 50.8617 | 23.6824 | 23.2417 | | vit_base_patch16_224 | 64 | 0.9784 | 4.5034 | 7.2422 | 90.0979 | 23.6698 | 22.9412 | | mobilenetv3_large_100 | 128 | 1.7096 | 5.7653 | 13.4328 | 142.3314 | 23.631 | 22.049 | | beit_base_patch16_224 | 64 | 1.2594 | 5.4088 | nan | nan | 23.1436 | 21.2986 | | mobilenetv2_100 | 128 | 1.7348 | 5.5579 | 12.8836 | 115.6263 | 21.912 | 20.8556 | | repvgg_a2 | 128 | 2.0993 | 6.2295 | 15.6915 | 190.227 | 21.9087 | 21.1098 | | regnety_002 | 128 | 1.7366 | 5.8363 | 13.1767 | 116.1875 | 21.1965 | 19.7573 | | mnasnet_100 | 128 | 1.8189 | 5.2798 | 13.436 | 110.8348 | 20.8608 | 19.6753 | | gernet_l | 128 | 2.0725 | 6.3321 | 15.476 | 111.6961 | 20.7129 | 19.6767 | | selecsls42b | 128 | 0.8294 | 3.6995 | 5.8344 | 88.2922 | 20.2905 | 17.304 | | lcnet_050 | 128 | 1.0365 | 3.3959 | 7.9625 | 81.0304 | 16.5345 | 14.3874 | | ese_vovnet19b_dw | 128 | 1.0119 | 3.1816 | 6.9082 | 66.8317 | 14.3342 | 13.3635 | | eca_halonext26ts | 128 | 1.4798 | 4.9573 | 12.0437 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2666 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1831 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3632 | nan | 1.1604 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.476 | 1.1067 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0828 | 1.1492 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0378 | 1.1389 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3555 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.268 | 0.3766 | 1.0332 | 1.1822 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2248 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9699 | 1.0818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3363 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9318 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.929 | 0.9804 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9111 | 0.981 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.9069 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8975 | 1.0248 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3529 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2718 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8095 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8236 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3515 | 1.1591 | 0.8033 | 1.0359 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.7449 | 0.8293 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6707 | 0.8617 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.267 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 297.2012 | 297.6552 | 322.2704 | nan | 281.5692 | 282.9016 | | hrnet_w18 | 128 | 295.7637 | 292.6829 | 350.2857 | nan | 188.6129 | 201.926 | | convnext_base | 64 | 121.7902 | 122.1265 | 153.0377 | nan | 182.7533 | 184.3682 | | tnt_s_patch16_224 | 128 | 364.4692 | 365.1483 | nan | nan | 171.0111 | 173.9549 | | pnasnet5large | 16 | 217.0867 | 213.6155 | 259.8056 | nan | 169.2449 | 173.5484 | | tf_mixnet_l | 128 | 195.0623 | 211.0251 | 241.4356 | nan | 162.6211 | 163.3519 | | mixnet_l | 128 | 186.991 | 202.6174 | 230.6038 | nan | 157.7774 | 158.533 | | convit_base | 64 | 181.5497 | 182.0696 | 218.626 | 148.4995 | 132.9989 | 137.548 | | pit_b_224 | 64 | 155.2003 | 155.724 | 188.7323 | 159.4835 | 114.7917 | 115.2509 | | dla102 | 128 | 178.806 | 179.3349 | 213.6339 | 136.064 | 112.9703 | 115.3019 | | poolformer_m36 | 64 | 149.0007 | 149.3938 | 184.9647 | nan | 112.2877 | 114.86 | | cait_m36_384 | 4 | 167.1017 | 168.4447 | nan | nan | 110.3795 | 115.1996 | | resnest101e | 64 | 164.5008 | 165.2553 | 200.6749 | nan | 107.9529 | 114.455 | | gluon_inception_v3 | 128 | 161.4164 | 162.2002 | 189.4627 | 141.4838 | 107.3705 | 109.8805 | | adv_inception_v3 | 128 | 161.1299 | 161.7173 | 191.1698 | 141.236 | 107.33 | 109.5753 | | inception_v3 | 128 | 161.1853 | 161.4862 | 191.2728 | 141.1463 | 107.2584 | 109.5566 | | beit_base_patch16_224 | 64 | 135.211 | 138.2616 | nan | nan | 105.875 | 106.3443 | | swsl_resnext101_32x16d | 32 | 118.0356 | 120.0446 | 146.5326 | nan | 104.2256 | 111.2939 | | vit_base_patch16_224 | 64 | 120.7271 | 121.44 | 144.7644 | 132.2696 | 100.8705 | 102.3045 | | res2net50_14w_8s | 128 | 146.3134 | 147.2252 | 180.9862 | 148.2273 | 99.958 | 104.1165 | | res2next50 | 128 | 138.4534 | 139.1172 | 166.9756 | 121.0783 | 97.9671 | 102.5907 | | swin_base_patch4_window7_224 | 64 | 148.0334 | 153.8232 | nan | nan | 97.3525 | 96.7401 | | dpn107 | 32 | 113.9935 | 115.7366 | 142.9863 | nan | 93.5454 | 91.8147 | | mixer_b16_224 | 128 | 118.6705 | 118.9799 | 148.1175 | 131.7066 | 92.2599 | 92.9467 | | gmlp_s16_224 | 128 | 136.3028 | 137.0485 | 173.8843 | 136.4323 | 87.8767 | 87.9858 | | dm_nfnet_f0 | 128 | 131.9656 | 131.3687 | 149.6431 | 142.8271 | 87.4064 | 91.773 | | eca_botnext26ts_256 | 128 | 112.1528 | 135.4549 | 163.9329 | 103.2135 | 86.2292 | 86.4199 | | jx_nest_base | 32 | 119.1945 | 119.9259 | 148.8671 | nan | 85.0492 | 86.954 | | gluon_xception65 | 32 | 97.9304 | 99.1017 | 132.3362 | nan | 84.5906 | 86.8671 | | volo_d1_224 | 64 | 134.5055 | 135.3838 | 159.3808 | nan | 84.0423 | 85.79 | | fbnetv3_b | 128 | 121.239 | 122.7734 | 150.0354 | nan | 83.4748 | 82.5727 | | visformer_small | 128 | 98.423 | 98.0585 | 116.9899 | nan | 79.8724 | 83.9585 | | botnet26t_256 | 128 | 106.0169 | 106.5251 | 128.1581 | 81.2427 | 78.2773 | 77.9489 | | res2net101_26w_4s | 64 | 129.4056 | 120.7372 | 127.7173 | nan | 77.8875 | 93.5689 | | gmixer_24_224 | 128 | 120.143 | 136.2688 | 166.6362 | 130.1531 | 77.8693 | 80.6259 | | crossvit_9_240 | 128 | 109.4211 | 109.9573 | 130.7572 | 119.2477 | 75.834 | 77.1816 | | twins_pcpvt_base | 64 | 125.8281 | 140.1379 | 138.619 | nan | 74.0034 | 82.8287 | | deit_base_distilled_patch16_224 | 64 | 94.4701 | 95.0941 | 118.3525 | 96.7031 | 73.5168 | 74.7578 | | gernet_l | 128 | 79.7826 | 80.63 | 98.8496 | 71.2254 | 71.1511 | 70.2217 | | coat_lite_mini | 128 | 116.0688 | 116.6898 | 137.2579 | 102.6788 | 70.8925 | 71.2567 | | cspdarknet53 | 64 | 95.9988 | 97.1337 | 120.3978 | 83.4959 | 69.4271 | 68.4458 | | rexnet_100 | 128 | 91.2005 | 103.7693 | 127.8593 | nan | 69.0663 | 68.9124 | | repvgg_a2 | 128 | 79.8934 | 80.4943 | 94.7705 | 70.2568 | 68.3359 | 67.2041 | | nfnet_l0 | 128 | 106.0545 | 131.1336 | 149.0558 | 124.6301 | 68.2231 | 72.1704 | | sebotnet33ts_256 | 64 | 83.2947 | 96.3097 | 118.7311 | 82.8617 | 67.3085 | 66.9981 | | tf_efficientnet_b0 | 128 | 90.8434 | 108.3816 | 131.7529 | 93.5193 | 64.9063 | 64.6038 | | mobilevit_s | 64 | 89.9953 | 107.7772 | 133.809 | nan | 63.202 | 63.7595 | | fbnetc_100 | 128 | 88.124 | 89.0205 | 106.1362 | 71.972 | 62.2578 | 60.9353 | | xcit_large_24_p8_224 | 5 | 152.1427 | nan | nan | nan | 60.5541 | 70.0063 | | tinynet_a | 128 | 75.6275 | 93.6766 | 111.4048 | 92.1521 | 58.3689 | 57.7592 | | spnasnet_100 | 128 | 76.8671 | 77.5886 | 93.4156 | 66.3198 | 53.3088 | 52.0371 | | ese_vovnet19b_dw | 128 | 67.9256 | 68.3473 | 86.1823 | 58.473 | 48.1658 | 47.9167 | | resmlp_12_224 | 128 | 68.2422 | 68.3942 | 87.3654 | 45.9677 | 48.0258 | 48.3864 | | mnasnet_100 | 128 | 70.313 | 70.8712 | 85.2137 | 56.5084 | 46.9784 | 45.9185 | | ghostnet_100 | 128 | 95.2988 | 98.0453 | 108.4538 | 101.9774 | 46.2471 | 53.8296 | | mobilenetv2_100 | 128 | 67.7307 | 68.5022 | 89.5881 | 57.2101 | 46.0348 | 45.0417 | | selecsls42b | 128 | 62.9197 | 63.2483 | 74.965 | 49.0077 | 43.6689 | 44.4996 | | mobilenetv3_large_100 | 128 | 66.0048 | 66.8591 | 80.9122 | 63.9576 | 43.6135 | 43.9001 | | regnety_002 | 128 | 54.1356 | 56.4044 | 47.2578 | 61.9577 | 25.6821 | 36.8173 | | lcnet_050 | 128 | 34.1434 | 35.0589 | 39.0166 | 34.1058 | 16.48 | 20.5777 | | eca_halonext26ts | 128 | 115.9964 | 139.8324 | 168.2269 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/c0L2izs.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/Wus32jD.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/nPuG8St.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 95%, 53/56 | 91%, 43/47  | 98%, 60/61  |
|       aot_eager        | 91%, 51/56 | 91%, 43/47  | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 41/56 | 34%, 16/47  | 46%, 28/61  |
|    nvprims_nvfuser     | 75%, 42/56 | 57%, 27/47  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 83%, 39/47  | 93%, 57/61  |
| inductor_no_cudagraphs | 89%, 50/56 | 87%, 41/47  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.00x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.04x    |    1.14x    |
|        inductor        |   1.47x    |    1.25x    |    1.23x    |
| inductor_no_cudagraphs |   1.23x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.84    |    2.48     |    2.00     |
|       aot_eager        |    5.51    |    7.97     |    7.17     |
|     aot_cudagraphs     |    7.42    |    15.41    |    13.11    |
|    nvprims_nvfuser     |   64.72    |   100.64    |   148.57    |
|        inductor        |   30.09    |    33.07    |    35.96    |
| inductor_no_cudagraphs |   29.79    |    29.18    |    34.75    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.93x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.91x    |    1.02x    |    0.95x    |
|        inductor        |   0.83x    |    0.75x    |    0.98x    |
| inductor_no_cudagraphs |   0.98x    |    1.00x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Previous report name: /data/home/anijain/cluster/cron_logs/day_320_16_11_22_performance_float32_377 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 84%, 47/56 | 84%, 47/56 | | inductor | huggingface | 91%, 39/43 | 91%, 39/43 | | inductor | timm_models | 93%, 57/61 | 93%, 57/61 | | inductor_no_cudagraphs | torchbench | 91%, 51/56 | 91%, 51/56 | | inductor_no_cudagraphs | huggingface | 91%, 39/43 | 91%, 39/43 | | inductor_no_cudagraphs | timm_models | 93%, 57/61 | 93%, 57/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.53x | 1.52x | | inductor | huggingface | 1.30x | 1.30x | | inductor | timm_models | 1.25x | 1.24x | | inductor_no_cudagraphs | torchbench | 1.23x | 1.23x | | inductor_no_cudagraphs | huggingface | 1.23x | 1.23x | | inductor_no_cudagraphs | timm_models | 1.24x | 1.24x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+---------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v2_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | huggingface | DebertaV2ForMaskedLM | fail_to_run | 0.0000 | | huggingface | BlenderbotForCausalLM | 0.0000 | 0.0000 | | timm_models | deit_base_distilled_patch16_224 | pass | fail_accuracy | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | +-------------+---------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.8383 | 0.921 | | torchbench | nvidia_deeprecommender | 0.9043 | 0.9644 | | torchbench | hf_GPT2_large | 0.0 | 1.4723 | | torchbench | tacotron2 | 0.0 | 0.9261 | | torchbench | hf_T5 | 0.0 | 1.562 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.0025 | 0.8488 | | huggingface | DebertaV2ForQuestionAnswering | 0.9244 | 0.9075 | | huggingface | TrOCRForCausalLM | 0.0 | 1.0282 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.0151 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5102 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | yolov3 | 365.7827 | 364.9662 | | torchbench | timm_efficientdet | 124.4288 | 119.7085 | | huggingface | DebertaV2ForMaskedLM | 162.0953 | 48.897 | | huggingface | DebertaV2ForQuestionAnswering | 161.8923 | 48.3798 | | huggingface | XLNetLMHeadModel | 120.5635 | 122.4315 | +-------------+-------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0023 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8738 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | pytorch_unet | 0.8484 | 1.0138 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | dlrm | 0.7035 | 0.7307 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6882 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4213 | 0.4334 | | torchbench | dcgan | 0.2564 | 0.2576 | | huggingface | YituTechConvBert | 0.894 | 0.9822 | | huggingface | DistillGPT2 | 0.8939 | 1.0108 | | huggingface | M2M100ForConditionalGeneration | 0.8869 | 1.0205 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | PegasusForConditionalGeneration | 0.8637 | 1.0262 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | PLBartForCausalLM | 0.8367 | 1.0581 | | huggingface | T5Small | 0.8215 | 1.1049 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | MBartForConditionalGeneration | 0.7896 | 0.9837 | | huggingface | MT5ForConditionalGeneration | 0.7785 | 0.9242 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.958 | | huggingface | MegatronBertForQuestionAnswering | 0.7709 | 1.0379 | | huggingface | MegatronBertForCausalLM | 0.7673 | 1.0153 | | huggingface | MBartForCausalLM | 0.7326 | 0.9478 | | huggingface | BertForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | RobertaForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | BertForMaskedLM | 0.6945 | 0.9772 | | huggingface | RobertaForCausalLM | 0.6942 | 0.9771 | | huggingface | CamemBert | 0.6942 | 0.9746 | | huggingface | Speech2Text2ForCausalLM | 0.675 | 0.9168 | | huggingface | DistilBertForQuestionAnswering | 0.6589 | 0.9118 | | huggingface | DistilBertForMaskedLM | 0.6509 | 0.9194 | | huggingface | Reformer | 0.573 | 1.0028 | | huggingface | DebertaV2ForMaskedLM | 0.5682 | 0.9491 | | huggingface | MobileBertForMaskedLM | 0.4951 | 0.6649 | | huggingface | DebertaV2ForQuestionAnswering | 0.4735 | 0.984 | | huggingface | MobileBertForQuestionAnswering | 0.4145 | 0.535 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | BlenderbotForCausalLM | nan | 0.8509 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | convnext_base | 0.8578 | 1.0369 | | timm_models | pit_b_224 | 0.8526 | 1.0752 | | timm_models | coat_lite_mini | 0.8213 | 1.0246 | | timm_models | sebotnet33ts_256 | 0.8189 | 0.9416 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7449 | 0.9008 | | timm_models | crossvit_9_240 | 0.6742 | 0.9001 | | timm_models | tnt_s_patch16_224 | nan | 0.8633 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Accuracy regressions ~~~ +------------------------+----------------------+-------------+-------------+ | compiler | name | prev_status | cur_status | +------------------------+----------------------+-------------+-------------+ | inductor | functorch_dp_cifar10 | pass | fail_to_run | | inductor_no_cudagraphs | functorch_dp_cifar10 | pass | fail_to_run | +------------------------+----------------------+-------------+-------------+ ~~~ Performance speedup regressions ~~~ +------------------------+----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+----------------------+-------------+------------+ | inductor | functorch_dp_cifar10 | 3.6424 | 0.0 | | inductor_no_cudagraphs | lennard_jones | 0.9603 | 0.921 | | inductor_no_cudagraphs | functorch_dp_cifar10 | 1.2047 | 0.0 | +------------------------+----------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +----------+--------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+--------------+-------------+------------+ | inductor | pytorch_unet | 0.9117 | 0.8484 | +----------+--------------+-------------+------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Performance speedup regressions ~~~ +----------+------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+------------------+-------------+------------+ | inductor | TrOCRForCausalLM | 1.0052 | 0.0 | +----------+------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Peak Memory Compression Ratio regressions ~~~ +----------+---------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+---------------+-------------+------------+ | inductor | convnext_base | 0.9576 | 0.8578 | +----------+---------------+-------------+------------+ ~~~

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 1.0247 | 2.3706 | 0.7796 | 5.2827 | 1.2791 | | timm_efficientdet | 1 | 0.9802 | 0.8933 | 1.8185 | 0.7642 | 4.2831 | 1.5089 | | timm_vision_transformer | 8 | 1.0053 | 0.9414 | 1.5082 | 0.6562 | 2.8543 | 1.4127 | | drq | 1 | 0.9896 | 0.8637 | 1.5508 | 0.7067 | 2.4604 | 1.0479 | | BERT_pytorch | 16 | 1.0132 | 0.8909 | 1.1032 | 0.9351 | 2.1083 | 2.1039 | | resnext50_32x4d | 8 | 1.0009 | 1.1128 | 1.2157 | 0.8046 | 2.0384 | 1.2054 | | mobilenet_v3_large | 32 | 1.0016 | 1.1142 | 1.0351 | 0.8831 | 1.983 | 1.3489 | | dcgan | 32 | 0.9981 | 1.0301 | 1.2728 | 0.8151 | 1.9126 | 1.0353 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9959 | 1.0206 | 1.3118 | 0.8527 | 1.8877 | 1.3656 | | resnet18 | 16 | 1.0007 | 1.115 | 1.1898 | 0.8839 | 1.8648 | 1.2495 | | squeezenet1_1 | 32 | 0.9973 | 1.0092 | 1.0774 | 0.8768 | 1.8462 | 1.2547 | | lennard_jones | 1000 | 0.9515 | 0.8623 | 1.0529 | 0.6826 | 1.8383 | 0.921 | | pytorch_struct | 200 | 0.9909 | 0.7539 | 0.8633 | 0.7808 | 1.813 | 1.1444 | | hf_T5_large | 2 | 1.0247 | 0.904 | 0.0 | 0.0 | 1.6622 | 1.6362 | | hf_Albert | 8 | 1.0005 | 0.9988 | 0.753 | 1.5579 | 1.6466 | 1.6392 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 1.0759 | 0.8079 | 0.9016 | 1.6084 | 1.4361 | | timm_resnest | 32 | 0.9993 | 1.0035 | 0.8048 | 1.1671 | 1.5203 | 1.4523 | | hf_GPT2 | 4 | 1.0102 | 0.9801 | 0.7416 | 0.4023 | 1.4931 | 1.4992 | | mnasnet1_0 | 32 | 0.9996 | 1.0993 | 0.8572 | 0.9093 | 1.4742 | 1.2782 | | soft_actor_critic | 256 | 0.9566 | 0.7837 | 1.0452 | 0.6917 | 1.4459 | 0.9507 | | speech_transformer | 32 | 0.9948 | 0.8879 | 1.3918 | 0.7428 | 1.4339 | 1.4342 | | mobilenet_v2 | 96 | 0.9996 | 0.9992 | 0.728 | 1.3357 | 1.4283 | 1.4071 | | fastNLP_Bert | 6 | 0.9975 | 0.9775 | 0.7522 | 1.1544 | 1.4173 | 1.3889 | | timm_efficientnet | 32 | 0.9563 | 0.8107 | 0.6961 | 0.8181 | 1.3858 | 1.189 | | LearningToPaint | 96 | 1.0029 | 1.0721 | 0.8682 | 0.9954 | 1.3484 | 1.2065 | | pytorch_stargan | 16 | 0.9992 | 1.0759 | 0.9356 | 0.0 | 1.2672 | 1.2283 | | resnet152 | 32 | 1.0017 | 1.0684 | 0.8489 | 0.9125 | 1.2359 | 1.2018 | | hf_Bart | 4 | 1.0113 | 0.9725 | 0.7341 | 0.8608 | 1.2099 | 1.1932 | | resnet50 | 32 | 0.9987 | 0.9931 | 0.7589 | 0.9956 | 1.2062 | 1.1695 | | hf_Bert | 4 | 1.0279 | 0.9969 | 0.7285 | 0.8705 | 1.2005 | 1.1868 | | pytorch_unet | 1 | 0.9999 | 0.2816 | 0.0 | 0.0 | 1.2004 | 1.1866 | | timm_nfnet | 128 | 0.9993 | 1.0005 | 0.0 | 1.134 | 1.189 | 1.1591 | | hf_DistilBert | 8 | 1.0007 | 0.9572 | 0.6874 | 0.524 | 1.1729 | 1.1808 | | vgg16 | 64 | 0.9998 | 0.9987 | 0.8583 | 0.9978 | 1.1726 | 1.1664 | | alexnet | 128 | 0.9999 | 0.998 | 0.8034 | 1.0047 | 1.1654 | 1.1676 | | Super_SloMo | 6 | 0.9999 | 0.2426 | 0.0 | 0.2482 | 1.1448 | 1.1304 | | hf_Reformer | 4 | 0.998 | 1.002 | 0.9895 | 0.7388 | 1.132 | 1.1413 | | timm_regnet | 32 | 0.9647 | 0.9616 | 0.7812 | 1.0925 | 1.1215 | 1.0856 | | Background_Matting | 4 | 1.0002 | 0.1919 | 0.0 | 0.0 | 1.0839 | 1.075 | | yolov3 | 16 | 0.9998 | 0.9945 | 0.7925 | 1.1522 | 1.0823 | 1.0732 | | attention_is_all_you_need_pytorch | 256 | 0.9997 | 0.9703 | 0.7575 | 0.9574 | 1.0503 | 1.0386 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.9787 | 0.0 | 1.4619 | 1.0492 | 1.1085 | | timm_vision_transformer_large | 8 | 1.0012 | 0.9944 | 0.0 | 0.0 | 1.048 | 1.0292 | | timm_vovnet | 32 | 0.9124 | 0.9044 | 0.7153 | 0.9052 | 1.014 | 1.0218 | | tts_angular | 64 | 0.9815 | 0.9645 | 0.9835 | 0.9741 | 1.0022 | 1.0161 | | demucs | 4 | 0.9999 | 1.0001 | 1.0 | 0.9993 | 0.9999 | 0.9999 | | resnet50_quantized_qat | 32 | 1.0018 | 0.974 | 0.0 | 1.1521 | 0.9961 | 1.0342 | | dlrm | 2048 | 0.0 | 0.0 | 0.0 | 0.0 | 0.96 | 1.274 | | nvidia_deeprecommender | 256 | 0.9992 | 0.9636 | 0.5844 | 0.976 | 0.9043 | 0.9644 | | hf_GPT2_large | 4 | 0.9998 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4723 | | tacotron2 | 64 | 0.9625 | 0.8645 | 0.0 | 0.767 | 0.0 | 0.9261 | | hf_T5 | 8 | 1.0013 | 0.9535 | 0.0 | 1.1674 | 0.0 | 1.562 | | functorch_dp_cifar10 | 64 | 0.9974 | 1.0322 | 1.9385 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9612 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9068 | 7.0508 | 10.0543 | 112.7418 | 365.7827 | 364.9662 | | timm_efficientdet | 1 | 19.4673 | 33.0262 | 65.8656 | 489.6565 | 124.4288 | 119.7085 | | hf_T5_large | 2 | 14.0978 | 34.1107 | nan | nan | 106.0265 | 103.9359 | | mobilenet_v2_quantized_qat | 96 | 1.3429 | 7.3211 | nan | 183.0044 | 83.146 | 83.3415 | | resnet50_quantized_qat | 32 | 1.1746 | 6.9947 | nan | 166.6848 | 69.3921 | 69.2173 | | timm_nfnet | 128 | 2.0316 | 6.0865 | nan | 153.3715 | 57.5558 | 57.7133 | | timm_vision_transformer_large | 8 | 2.4778 | 11.1282 | nan | nan | 52.9717 | 49.9772 | | attention_is_all_you_need_pytorch | 256 | 1.1954 | 5.5311 | 8.8432 | 117.0489 | 42.1644 | 41.1417 | | densenet121 | 4 | 2.2003 | 9.9041 | 15.7664 | 164.6308 | 41.8399 | 40.3834 | | resnet152 | 32 | 2.4928 | 11.0786 | 17.9896 | 191.9334 | 40.5206 | 38.9283 | | timm_resnest | 32 | 0.5848 | 2.0251 | 3.0357 | 59.2368 | 39.6217 | 39.6549 | | timm_vision_transformer | 8 | 0.8698 | 3.3907 | 4.8067 | 73.2273 | 32.7642 | 30.5594 | | hf_Bart | 4 | 1.6707 | 6.6558 | 10.045 | 124.428 | 29.1454 | 28.0159 | | fastNLP_Bert | 6 | 1.5436 | 5.3163 | 8.5865 | 79.0488 | 26.7062 | 24.7113 | | BERT_pytorch | 16 | 1.5199 | 5.9385 | 8.8844 | 83.9364 | 26.2045 | 26.0516 | | pytorch_stargan | 16 | 0.4269 | 1.6869 | 2.4333 | nan | 25.8069 | 25.7623 | | speech_transformer | 32 | 1.6038 | 6.6953 | 26.3553 | 132.0394 | 23.8676 | 23.0901 | | timm_regnet | 32 | 2.2907 | 6.5354 | 17.1332 | 112.5755 | 23.1997 | 22.5121 | | timm_efficientnet | 32 | 1.7348 | 5.6702 | 13.6002 | 107.2096 | 23.0774 | 22.0953 | | pytorch_struct | 200 | 0.2544 | 0.6346 | 1.1464 | 4.8212 | 22.5135 | 19.1357 | | mobilenet_v3_large | 32 | 0.9123 | 3.8379 | 5.7347 | 98.9288 | 22.2284 | 22.1208 | | hf_Bert | 4 | 1.6157 | 5.2218 | 7.6901 | 81.9716 | 18.9409 | 18.3124 | | mnasnet1_0 | 32 | 0.8318 | 3.532 | 5.2027 | 71.5152 | 18.0873 | 17.5381 | | timm_vovnet | 32 | 1.4979 | 3.7862 | 9.0072 | 56.5464 | 18.0086 | 17.0891 | | Super_SloMo | 6 | 1.1289 | 6.4863 | nan | 56.8753 | 17.9164 | 17.7553 | | hf_Reformer | 4 | 1.5872 | 2.6314 | 4.7964 | 14.802 | 17.7986 | 15.2907 | | resnet50 | 32 | 0.8855 | 3.73 | 5.7792 | 76.9769 | 17.6175 | 17.0391 | | shufflenet_v2_x1_0 | 128 | 0.9493 | 4.2127 | 6.2406 | 85.7455 | 17.3056 | 16.6905 | | hf_GPT2 | 4 | 1.4771 | 5.052 | 7.5016 | 59.5194 | 17.1817 | 17.1151 | | hf_Albert | 8 | 1.2635 | 4.6425 | 7.3729 | 111.6843 | 17.0582 | 16.3167 | | resnext50_32x4d | 8 | 0.9128 | 3.7703 | 5.5345 | 63.5511 | 16.8674 | 16.1514 | | Background_Matting | 4 | 0.7666 | 7.8846 | nan | nan | 16.3852 | 15.7681 | | mobilenet_v2 | 96 | 0.835 | 3.7647 | 6.0826 | 97.3752 | 16.0498 | 16.1428 | | hf_DistilBert | 8 | 0.656 | 2.5799 | 4.4865 | 44.2064 | 11.8434 | 11.2371 | | resnet18 | 16 | 0.4217 | 1.4646 | 2.1518 | 29.8792 | 10.5478 | 10.4107 | | pytorch_unet | 1 | 0.4593 | 2.716 | nan | nan | 8.2886 | 8.1285 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3766 | 1.5952 | 2.2484 | 32.1908 | 7.9052 | 8.0134 | | LearningToPaint | 96 | 0.4418 | 1.5637 | 2.3799 | 38.0753 | 6.9488 | 6.4601 | | dcgan | 32 | 0.173 | 0.3611 | 0.5642 | 4.3273 | 5.8557 | 5.6833 | | squeezenet1_1 | 32 | 0.2312 | 0.6518 | 1.0504 | 4.3727 | 3.8237 | 3.531 | | drq | 1 | 0.3102 | 0.5156 | 0.8784 | 4.3617 | 3.622 | 3.2039 | | soft_actor_critic | 256 | 0.2121 | 0.3077 | 0.5139 | 1.6228 | 3.3155 | 2.7886 | | vgg16 | 64 | 0.1804 | 0.4844 | 0.7583 | 3.1265 | 3.2715 | 3.2148 | | nvidia_deeprecommender | 256 | 0.2041 | 0.3781 | 0.6329 | 4.8541 | 3.2695 | 2.9961 | | dlrm | 2048 | nan | nan | nan | nan | 3.2266 | 2.7927 | | alexnet | 128 | 0.1561 | 0.3266 | 0.5517 | 2.9155 | 2.8255 | 2.6227 | | lennard_jones | 1000 | 0.1409 | 0.2477 | 0.3898 | 1.3691 | 1.9306 | 1.691 | | tts_angular | 64 | 0.1769 | 0.2219 | 0.3471 | 1.0808 | 1.8281 | 1.6147 | | demucs | 4 | 0.3003 | 0.3159 | 0.2992 | 0.3117 | 0.2128 | 0.2109 | | tacotron2 | 64 | 4.8728 | 14.2516 | nan | 43.5076 | nan | 43.6939 | | hf_GPT2_large | 4 | 5.1987 | 16.0391 | nan | nan | nan | 43.0863 | | hf_T5 | 8 | 2.5927 | 7.7199 | nan | 75.5526 | nan | 26.5811 | | functorch_dp_cifar10 | 64 | 0.3052 | 1.1036 | 1.6813 | nan | nan | nan | | hf_BigBird | 2 | 3.3446 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5274 | 1.5274 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4599 | 1.4604 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.988 | 1.3107 | 1.3923 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2891 | 1.1335 | 1.1162 | 1.1442 | | Super_SloMo | 6 | 1.0023 | 0.9018 | nan | 0.9454 | 1.1136 | 1.341 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 1.0433 | 1.1066 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2738 | 1.1206 | 1.0396 | 1.0443 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 0.9435 | 1.0968 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9405 | 1.0831 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9342 | 1.0395 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.923 | 1.1042 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9162 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9066 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.7926 | 0.8982 | 1.0023 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8738 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.8484 | 1.0138 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8777 | 0.3388 | 1.0867 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3504 | 1.1289 | 0.826 | 1.0815 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3307 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | dlrm | 2048 | nan | nan | nan | nan | 0.7035 | 0.7307 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6268 | 0.6882 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9995 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4213 | 0.4334 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1617 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1439 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 2048 | nan | nan | nan | nan | 529.4054 | 439.5616 | | timm_vision_transformer_large | 8 | 196.3032 | 198.6271 | nan | nan | 187.7582 | 192.3282 | | timm_nfnet | 128 | 205.7073 | 206.0187 | nan | 181.274 | 173.0315 | 177.8934 | | Background_Matting | 4 | 193.3555 | 972.1469 | nan | nan | 172.1589 | 173.48 | | mobilenet_v2_quantized_qat | 96 | 146.8341 | 150.3634 | nan | 101.0142 | 140.6043 | 138.2111 | | hf_T5_large | 2 | 189.1722 | 213.1661 | nan | nan | 119.0845 | 123.3992 | | Super_SloMo | 6 | 117.7546 | 485.2174 | nan | 474.2596 | 102.7856 | 104.1082 | | yolov3 | 16 | 102.2976 | 102.5593 | 128.9733 | 88.8053 | 94.5299 | 95.1959 | | resnet50_quantized_qat | 32 | 92.8819 | 95.3228 | nan | 80.7558 | 93.9204 | 90.9069 | | vgg16 | 64 | 106.7598 | 106.7789 | 123.8231 | 106.5036 | 90.7125 | 91.1032 | | timm_regnet | 32 | 101.6102 | 101.6388 | 125.3646 | 89.2014 | 87.2825 | 90.3563 | | demucs | 4 | 77.7149 | 77.7819 | 77.5508 | 77.6993 | 77.5825 | 77.6717 | | hf_Reformer | 4 | 83.327 | 82.8958 | 84.0203 | 112.4333 | 73.5765 | 72.9389 | | resnet152 | 32 | 91.4133 | 85.3451 | 113.3507 | 99.2422 | 73.4808 | 75.7634 | | attention_is_all_you_need_pytorch | 256 | 71.9612 | 74.039 | 95.3371 | 75.312 | 68.8382 | 69.4871 | | mobilenet_v2 | 96 | 71.4698 | 71.5264 | 98.0744 | 53.4803 | 49.956 | 50.7745 | | pytorch_unet | 1 | 58.5624 | 207.7289 | nan | nan | 48.7799 | 49.314 | | hf_Bart | 4 | 54.283 | 56.5541 | 74.583 | 63.8358 | 45.9139 | 45.9868 | | hf_Albert | 8 | 75.0773 | 75.1505 | 99.9228 | 48.0228 | 45.6142 | 45.6321 | | fastNLP_Bert | 6 | 59.6136 | 60.7715 | 79.1753 | 51.5683 | 42.0829 | 42.9103 | | timm_vovnet | 32 | 42.225 | 42.6107 | 53.995 | 42.5543 | 39.1036 | 37.6271 | | speech_transformer | 32 | 48.4068 | 57.1615 | 35.3542 | 64.7634 | 35.408 | 35.261 | | timm_efficientdet | 1 | 137.6143 | 151.5726 | 76.2296 | 179.0932 | 34.5179 | 98.0381 | | hf_GPT2 | 4 | 49.9402 | 51.2861 | 68.08 | 124.7623 | 33.6815 | 33.5086 | | timm_efficientnet | 32 | 43.8021 | 52.5105 | 61.031 | 51.8834 | 33.1914 | 36.4304 | | hf_DistilBert | 8 | 38.8108 | 40.4992 | 56.5671 | 74.1762 | 33.1524 | 32.9279 | | hf_Bert | 4 | 37.9955 | 38.9952 | 53.494 | 48.0827 | 32.4284 | 32.8822 | | resnet50 | 32 | 38.7282 | 38.8871 | 51.179 | 38.8144 | 32.1339 | 33.0762 | | shufflenet_v2_x1_0 | 128 | 37.1476 | 35.6445 | 45.8173 | 41.0737 | 23.4383 | 26.0191 | | BERT_pytorch | 16 | 45.8683 | 52.1962 | 41.951 | 49.7492 | 22.7797 | 23.0517 | | timm_resnest | 32 | 31.6734 | 31.5512 | 39.2459 | 27.1062 | 20.8361 | 21.7914 | | mnasnet1_0 | 32 | 28.3603 | 26.0116 | 33.0839 | 31.5885 | 19.4332 | 22.5972 | | pytorch_stargan | 16 | 24.251 | 22.5105 | 25.8949 | nan | 19.0903 | 19.6539 | | mobilenet_v3_large | 32 | 31.1289 | 28.2167 | 30.3164 | 36.3317 | 16.101 | 23.7571 | | resnext50_32x4d | 8 | 26.7258 | 24.0923 | 22.1164 | 33.4107 | 13.4273 | 22.8719 | | densenet121 | 4 | 65.1844 | 63.123 | 28.1693 | 84.3881 | 12.9635 | 57.737 | | LearningToPaint | 96 | 15.5603 | 14.7595 | 18.0355 | 15.9507 | 12.3487 | 13.7866 | | alexnet | 128 | 12.4129 | 12.4275 | 15.4538 | 12.3483 | 10.6807 | 10.6902 | | timm_vision_transformer | 8 | 23.3396 | 24.8556 | 15.7023 | 36.2359 | 9.6348 | 17.4825 | | nvidia_deeprecommender | 256 | 8.5342 | 8.8507 | 14.5826 | 8.7427 | 9.4271 | 8.8402 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.7132 | 16.018 | 12.646 | 19.4717 | 9.1782 | 12.2611 | | tts_angular | 64 | 9.4026 | 9.7132 | 9.3965 | 9.5484 | 9.1695 | 9.2865 | | squeezenet1_1 | 32 | 12.7918 | 12.5206 | 12.1437 | 14.7083 | 7.4001 | 10.2589 | | resnet18 | 16 | 12.0185 | 10.6852 | 10.2716 | 13.7102 | 6.5924 | 9.8691 | | pytorch_struct | 200 | 3.8608 | 5.0133 | 4.2598 | 4.8653 | 2.0966 | 3.3209 | | dcgan | 32 | 2.6815 | 2.5564 | 2.0876 | 3.2457 | 1.3536 | 2.5367 | | drq | 1 | 2.998 | 3.4891 | 2.0652 | 4.3357 | 1.2493 | 3.9696 | | soft_actor_critic | 256 | 1.071 | 1.3215 | 0.9807 | 1.527 | 0.7461 | 1.1172 | | lennard_jones | 1000 | 1.1087 | 1.2585 | 1.0341 | 1.5926 | 0.7277 | 1.2205 | | tacotron2 | 64 | 2814.305 | 3080.4445 | nan | 3464.1026 | nan | 3507.9445 | | hf_GPT2_large | 4 | 240.6819 | 245.6851 | nan | nan | nan | 163.563 | | hf_T5 | 8 | 183.2686 | 191.968 | nan | 156.3869 | nan | 117.0367 | | functorch_dp_cifar10 | 64 | 11.4872 | 11.025 | 5.8705 | nan | nan | nan | | hf_BigBird | 2 | 208.2946 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9989 | 0.9188 | 0.0 | 0.8873 | 1.8036 | 1.8129 | | GPT2ForSequenceClassification | 4 | 1.0009 | 0.9784 | 0.0 | 0.7072 | 1.7738 | 1.7651 | | XLNetLMHeadModel | 8 | 1.0 | 0.9656 | 0.0 | 0.0 | 1.7068 | 1.6989 | | MT5ForConditionalGeneration | 16 | 1.0233 | 0.9236 | 0.9166 | 1.0418 | 1.686 | 1.5259 | | GoogleFnet | 16 | 0.9998 | 0.999 | 0.0 | 1.5413 | 1.4428 | 1.5598 | | DistillGPT2 | 16 | 0.9997 | 0.9526 | 0.0 | 0.9164 | 1.4397 | 1.4847 | | T5Small | 4 | 1.0 | 0.9238 | 0.7266 | 1.1069 | 1.428 | 1.4069 | | ElectraForQuestionAnswering | 64 | 1.0003 | 0.9851 | 0.0 | 1.1931 | 1.4279 | 1.4091 | | T5ForConditionalGeneration | 4 | 0.9992 | 0.933 | 0.7242 | 1.089 | 1.4263 | 1.4289 | | ElectraForCausalLM | 32 | 1.0003 | 0.9221 | 0.0 | 1.0187 | 1.413 | 1.4514 | | MobileBertForMaskedLM | 64 | 1.0234 | 0.9226 | 0.793 | 0.0 | 1.4083 | 1.1794 | | MobileBertForQuestionAnswering | 128 | 1.0259 | 0.9421 | 0.0 | 0.0 | 1.3631 | 1.1097 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9896 | 0.7377 | 1.1065 | 1.3107 | 1.2904 | | RobertaForQuestionAnswering | 16 | 0.9999 | 0.9886 | 0.7345 | 1.1145 | 1.2941 | 1.2674 | | BertForQuestionAnswering | 16 | 1.0004 | 0.9876 | 0.7313 | 1.1151 | 1.2861 | 1.2682 | | RobertaForCausalLM | 16 | 1.0008 | 0.9715 | 0.0 | 1.0571 | 1.2667 | 1.2702 | | AlbertForQuestionAnswering | 4 | 1.0004 | 1.0016 | 0.0 | 1.2372 | 1.2585 | 1.2567 | | AlbertForMaskedLM | 4 | 1.0005 | 0.9999 | 0.0 | 1.2278 | 1.2542 | 1.2533 | | MegatronBertForQuestionAnswering | 8 | 1.0001 | 0.9801 | 0.0 | 1.0932 | 1.22 | 1.204 | | MegatronBertForCausalLM | 4 | 1.0002 | 0.9839 | 0.7254 | 1.0623 | 1.2134 | 1.1997 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9711 | 0.0 | 1.062 | 1.1932 | 1.1953 | | CamemBert | 16 | 1.0 | 0.9699 | 0.0 | 1.0603 | 1.1727 | 1.1779 | | BertForMaskedLM | 16 | 1.0004 | 0.9606 | 0.0 | 1.0619 | 1.1725 | 1.1804 | | YituTechConvBert | 16 | 1.0005 | 0.9684 | 0.0 | 1.0036 | 1.1724 | 1.1723 | | PLBartForConditionalGeneration | 4 | 1.0002 | 0.9633 | 0.0 | 0.9708 | 1.1626 | 1.1647 | | M2M100ForConditionalGeneration | 16 | 1.0766 | 0.9738 | 0.0 | 1.0365 | 1.1613 | 1.0707 | | DistilBertForQuestionAnswering | 256 | 0.9999 | 0.9995 | 0.0 | 0.7857 | 1.1572 | 1.1529 | | PLBartForCausalLM | 8 | 1.0 | 0.9518 | 0.0 | 0.9629 | 1.1543 | 1.1557 | | XGLMForCausalLM | 8 | 1.0103 | 0.939 | 0.7392 | 0.3118 | 1.1522 | 1.1397 | | Reformer | 16 | 0.9993 | 0.9999 | 0.9778 | 0.9759 | 1.1371 | 1.1509 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9874 | 0.0 | 1.0193 | 1.0964 | 1.0877 | | BartForConditionalGeneration | 2 | 1.0003 | 0.9884 | 0.0 | 0.4471 | 1.093 | 1.0859 | | MBartForCausalLM | 4 | 1.0005 | 0.9659 | 0.7544 | 0.9991 | 1.0838 | 1.0946 | | BartForCausalLM | 4 | 1.0004 | 0.965 | 0.7545 | 1.0024 | 1.083 | 1.0915 | | DebertaForMaskedLM | 4 | 0.8915 | 0.8023 | 0.7239 | 0.636 | 1.0742 | 1.0348 | | DebertaForQuestionAnswering | 8 | 0.9966 | 0.974 | 0.6835 | 0.8634 | 1.0491 | 1.2185 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9381 | 0.0 | 0.9531 | 1.0344 | 1.042 | | PegasusForConditionalGeneration | 32 | 0.9992 | 0.9801 | 0.0 | 0.9797 | 1.0174 | 1.0102 | | DistilBertForMaskedLM | 128 | 0.9996 | 0.9527 | 0.0 | 0.8023 | 1.0144 | 1.0325 | | DebertaV2ForMaskedLM | 1 | 0.87 | 0.7267 | 0.0 | 0.0 | 1.0025 | 0.8488 | | Speech2Text2ForCausalLM | 256 | 0.9984 | 0.9251 | 0.6511 | 0.9405 | 0.9907 | 1.0267 | | PegasusForCausalLM | 32 | 0.9993 | 0.9546 | 0.7331 | 0.9526 | 0.9726 | 0.9817 | | BlenderbotSmallForCausalLM | 64 | 1.0013 | 0.9066 | 0.6833 | 0.9195 | 0.9567 | 0.9893 | | DebertaV2ForQuestionAnswering | 2 | 0.9043 | 0.7851 | 0.0 | 0.5764 | 0.9244 | 0.9075 | | TrOCRForCausalLM | 32 | 1.0003 | 0.9563 | 0.0 | 0.9623 | 0.0 | 1.0282 | | BlenderbotForCausalLM | 4 | 1.0031 | 0.9828 | 0.0 | 0.9547 | 0.0 | 1.0151 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | DebertaV2ForMaskedLM | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | BlenderbotForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaV2ForMaskedLM | 1 | 7.3865 | 15.3451 | nan | nan | 162.0953 | 48.897 | | DebertaV2ForQuestionAnswering | 2 | 7.3132 | 15.2209 | nan | 158.225 | 161.8923 | 48.3798 | | XLNetLMHeadModel | 8 | 4.7144 | 17.0195 | nan | nan | 120.5635 | 122.4315 | | DebertaForQuestionAnswering | 8 | 4.5928 | 10.1933 | 33.8646 | 75.4601 | 101.8342 | 36.243 | | DebertaForMaskedLM | 4 | 4.6768 | 10.0759 | 33.5681 | 85.8039 | 91.7106 | 34.9605 | | XGLMForCausalLM | 8 | 2.5842 | 10.0409 | 21.7014 | 203.4112 | 70.7296 | 67.2277 | | MobileBertForQuestionAnswering | 128 | 8.5803 | 24.1419 | nan | nan | 54.3009 | 51.5768 | | MobileBertForMaskedLM | 64 | 8.7638 | 23.6379 | 40.696 | nan | 52.2288 | 50.4923 | | MT5ForConditionalGeneration | 16 | 3.7655 | 11.9401 | 17.8597 | 129.4502 | 50.0313 | 49.4657 | | M2M100ForConditionalGeneration | 16 | 3.105 | 13.3882 | nan | 249.0752 | 48.4502 | 50.5543 | | PegasusForConditionalGeneration | 32 | 2.9972 | 12.204 | nan | 298.3121 | 43.0994 | 38.7898 | | BartForConditionalGeneration | 2 | 3.216 | 12.6612 | nan | 286.592 | 43.0539 | 41.9564 | | MBartForConditionalGeneration | 2 | 3.2562 | 12.9807 | nan | 321.7214 | 42.4579 | 40.4498 | | YituTechConvBert | 16 | 2.352 | 8.4291 | nan | 120.5042 | 39.5516 | 36.725 | | MegatronBertForCausalLM | 4 | 3.6408 | 10.6164 | 16.9435 | 206.102 | 34.7616 | 31.8673 | | MegatronBertForQuestionAnswering | 8 | 3.2107 | 10.6458 | nan | 200.8069 | 32.5956 | 31.049 | | T5ForConditionalGeneration | 4 | 2.4465 | 7.5697 | 12.2001 | 79.5214 | 29.9271 | 29.5318 | | T5Small | 4 | 2.6083 | 7.6869 | 12.1765 | 79.725 | 29.4905 | 29.2608 | | BlenderbotSmallForConditionalGeneration | 64 | 2.041 | 8.2348 | nan | 167.0393 | 29.455 | 28.3672 | | LayoutLMForSequenceClassification | 16 | 1.9384 | 5.7316 | 8.8801 | 90.3093 | 27.3105 | 26.8507 | | GoogleFnet | 16 | 0.9311 | 2.8904 | nan | 45.8955 | 26.3845 | 19.3557 | | ElectraForCausalLM | 32 | 1.5815 | 5.502 | nan | 85.923 | 26.0839 | 24.2202 | | PLBartForConditionalGeneration | 4 | 1.6766 | 6.6681 | nan | 121.4207 | 25.6845 | 24.6785 | | PegasusForCausalLM | 32 | 1.2305 | 4.8951 | 7.9196 | 84.5418 | 21.5051 | 20.1309 | | MBartForCausalLM | 4 | 1.3128 | 4.9519 | 7.6832 | 90.4777 | 20.7358 | 20.8666 | | LayoutLMForMaskedLM | 16 | 1.9736 | 5.7592 | nan | 86.9363 | 20.6509 | 20.6893 | | BertForMaskedLM | 16 | 1.6762 | 5.5946 | nan | 85.5962 | 20.0493 | 19.3311 | | BartForCausalLM | 4 | 1.3097 | 4.8645 | 7.3061 | 81.1218 | 19.8835 | 18.8534 | | BertForQuestionAnswering | 16 | 1.583 | 5.4516 | 8.2873 | 86.8999 | 19.8107 | 18.7172 | | ElectraForQuestionAnswering | 64 | 1.5594 | 5.3755 | nan | 85.1881 | 19.625 | 19.4381 | | RobertaForCausalLM | 16 | 1.5875 | 5.4647 | nan | 87.4435 | 19.5156 | 18.833 | | CamemBert | 16 | 1.5418 | 5.4709 | nan | 87.2421 | 19.2563 | 18.3182 | | RobertaForQuestionAnswering | 16 | 1.5818 | 5.2876 | 8.1304 | 85.9992 | 18.546 | 18.1158 | | OPTForCausalLM | 2 | 1.2966 | 5.1406 | nan | 76.3168 | 17.5613 | 16.409 | | Reformer | 16 | 1.4644 | 2.7026 | 5.1206 | 16.133 | 17.3206 | 14.4002 | | GPT2ForSequenceClassification | 4 | 1.5213 | 5.3965 | nan | 61.9085 | 16.4048 | 16.1805 | | AlbertForMaskedLM | 4 | 1.3918 | 4.9118 | nan | 116.3496 | 15.358 | 15.3492 | | AlbertForQuestionAnswering | 4 | 1.4239 | 4.8061 | nan | 109.4566 | 15.0014 | 14.851 | | BlenderbotSmallForCausalLM | 64 | 0.8453 | 3.4849 | 4.9669 | 50.9602 | 13.9019 | 13.7653 | | Speech2Text2ForCausalLM | 256 | 0.7649 | 2.6141 | 4.3746 | 38.5446 | 13.7727 | 12.5531 | | DistillGPT2 | 16 | 0.812 | 2.716 | nan | 34.4056 | 13.1958 | 13.1645 | | PLBartForCausalLM | 8 | 0.718 | 2.6233 | nan | 44.4644 | 12.5128 | 12.2209 | | DistilBertForMaskedLM | 128 | 0.6602 | 2.664 | nan | 40.5922 | 10.8517 | 10.5887 | | DistilBertForQuestionAnswering | 256 | 0.7088 | 2.7659 | nan | 41.6128 | 10.1469 | 9.9705 | | BlenderbotForCausalLM | 4 | 2.3243 | 9.3711 | nan | 204.6385 | nan | 36.7878 | | TrOCRForCausalLM | 32 | 1.2935 | 4.932 | nan | 80.6792 | nan | 18.7227 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | XLNetLMHeadModel | 8 | 1.0 | 0.9323 | nan | nan | 0.9946 | 0.9946 | | GoogleFnet | 16 | 0.9224 | 0.9224 | nan | 1.4614 | 0.9608 | 1.2768 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9344 | nan | 1.274 | 0.9316 | 1.2234 | | OPTForCausalLM | 2 | 1.0001 | 0.9258 | nan | 1.0746 | 0.9068 | 1.1143 | | YituTechConvBert | 16 | 0.9966 | 0.9341 | nan | 0.9891 | 0.894 | 0.9822 | | DistillGPT2 | 16 | 1.0 | 0.8855 | nan | 1.055 | 0.8939 | 1.0108 | | M2M100ForConditionalGeneration | 16 | 0.9993 | 0.9414 | nan | 1.0265 | 0.8869 | 1.0205 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | PegasusForConditionalGeneration | 32 | 0.9981 | 0.9529 | nan | 1.1152 | 0.8637 | 1.0262 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | PLBartForCausalLM | 8 | 1.0 | 0.8896 | nan | 1.0988 | 0.8367 | 1.0581 | | T5Small | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.8429 | 0.7929 | 0.9036 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8931 | nan | 0.9681 | 0.7896 | 0.9837 | | MT5ForConditionalGeneration | 16 | 1.0014 | 0.8793 | 0.4388 | 0.9365 | 0.7785 | 0.9242 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 1.0402 | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.958 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9223 | nan | 1.0616 | 0.7709 | 1.0379 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9018 | 0.3475 | 0.9999 | 0.7673 | 1.0153 | | MBartForCausalLM | 4 | 1.0 | 0.9122 | 0.3642 | 1.0011 | 0.7326 | 0.9478 | | BertForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | BertForMaskedLM | 16 | 1.0 | 0.9408 | nan | 0.9928 | 0.6945 | 0.9772 | | RobertaForCausalLM | 16 | 1.0 | 0.9405 | nan | 0.9926 | 0.6942 | 0.9771 | | CamemBert | 16 | 1.0 | 0.9388 | nan | 0.987 | 0.6942 | 0.9746 | | Speech2Text2ForCausalLM | 256 | 0.9545 | 0.8748 | 0.3515 | 0.8692 | 0.675 | 0.9168 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9602 | nan | 1.1897 | 0.6589 | 0.9118 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8847 | nan | 0.8827 | 0.6509 | 0.9194 | | Reformer | 16 | 0.9773 | 0.9773 | 0.5544 | 0.9998 | 0.573 | 1.0028 | | DebertaV2ForMaskedLM | 1 | 1.0 | 0.9651 | nan | nan | 0.5682 | 0.9491 | | MobileBertForMaskedLM | 64 | 1.0 | 0.906 | 0.3175 | nan | 0.4951 | 0.6649 | | DebertaV2ForQuestionAnswering | 2 | 0.9842 | 0.9842 | nan | 0.9842 | 0.4735 | 0.984 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9909 | nan | nan | 0.4145 | 0.535 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3552 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | TrOCRForCausalLM | 32 | 1.0 | 0.8787 | nan | 0.9998 | nan | 0.9239 | | BlenderbotForCausalLM | 4 | 1.0001 | 0.8057 | nan | 0.8218 | nan | 0.8509 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 385.8811 | 384.4755 | nan | 311.8321 | 306.5974 | 306.5504 | | AlbertForQuestionAnswering | 4 | 382.8592 | 381.1495 | nan | 307.6403 | 303.8742 | 303.2409 | | Reformer | 16 | 306.0888 | 305.8305 | 313.2599 | 313.0561 | 269.4674 | 265.618 | | XLNetLMHeadModel | 8 | 374.2563 | 387.0332 | nan | nan | 220.0366 | 220.0048 | | PegasusForConditionalGeneration | 32 | 175.897 | 179.443 | nan | 179.2692 | 173.7031 | 174.0306 | | MegatronBertForQuestionAnswering | 8 | 172.0279 | 175.6003 | nan | 157.3921 | 141.3274 | 142.7305 | | BartForConditionalGeneration | 2 | 149.8926 | 151.597 | nan | 335.2815 | 137.2509 | 138.142 | | MBartForConditionalGeneration | 2 | 149.7924 | 151.9781 | nan | 147.1287 | 136.8388 | 137.5198 | | YituTechConvBert | 16 | 155.3191 | 160.1742 | nan | 154.5883 | 132.5956 | 132.347 | | DistilBertForQuestionAnswering | 256 | 144.4849 | 144.4885 | nan | 182.878 | 124.8741 | 125.6902 | | MobileBertForQuestionAnswering | 128 | 140.2783 | 152.5501 | nan | nan | 124.7846 | 128.7725 | | DistilBertForMaskedLM | 128 | 122.1143 | 128.1444 | nan | 152.1835 | 120.4766 | 118.6821 | | MobileBertForMaskedLM | 64 | 147.4828 | 178.6032 | 185.9993 | nan | 119.4933 | 126.6207 | | CamemBert | 16 | 135.5893 | 139.6414 | nan | 127.8354 | 115.983 | 115.1369 | | BlenderbotSmallForConditionalGeneration | 64 | 118.8674 | 126.9196 | nan | 125.0509 | 115.2535 | 114.393 | | LayoutLMForMaskedLM | 16 | 136.7944 | 140.8641 | nan | 128.8868 | 114.786 | 115.041 | | BertForMaskedLM | 16 | 134.1222 | 139.6814 | nan | 126.503 | 114.5803 | 113.6503 | | DebertaV2ForQuestionAnswering | 2 | 116.8079 | 134.3516 | nan | 183.1921 | 114.199 | 116.3429 | | BartForCausalLM | 4 | 123.3335 | 126.5344 | 163.5753 | 123.1354 | 113.9692 | 112.7979 | | MBartForCausalLM | 4 | 123.2504 | 127.4533 | 163.7131 | 123.4729 | 113.7409 | 112.4872 | | RobertaForCausalLM | 16 | 143.398 | 146.6814 | nan | 134.7168 | 112.6541 | 112.177 | | PLBartForConditionalGeneration | 4 | 121.425 | 125.9985 | nan | 125.0737 | 104.7426 | 104.4286 | | PLBartForCausalLM | 8 | 118.0932 | 124.5522 | nan | 122.7387 | 102.0922 | 99.6076 | | M2M100ForConditionalGeneration | 16 | 108.1811 | 126.8765 | nan | 111.9641 | 99.9635 | 109.1323 | | OPTForCausalLM | 2 | 168.7316 | 181.6036 | nan | 191.3912 | 93.983 | 93.1055 | | PegasusForCausalLM | 32 | 85.6514 | 89.5237 | 116.794 | 89.7397 | 88.3715 | 87.1299 | | DebertaV2ForMaskedLM | 1 | 100.4475 | 133.2586 | nan | nan | 88.1054 | 103.9728 | | ElectraForQuestionAnswering | 64 | 124.8568 | 127.5962 | nan | 104.6384 | 87.4918 | 88.5424 | | LayoutLMForSequenceClassification | 16 | 112.971 | 114.3095 | 153.4051 | 103.0753 | 86.4025 | 87.554 | | BertForQuestionAnswering | 16 | 110.4132 | 112.5047 | 151.2849 | 99.1237 | 86.0784 | 87.1639 | | RobertaForQuestionAnswering | 16 | 110.8925 | 112.0811 | 150.8343 | 99.4363 | 85.8071 | 87.4627 | | MegatronBertForCausalLM | 4 | 101.9438 | 103.5792 | 141.0935 | 95.7704 | 84.6124 | 84.9342 | | DistillGPT2 | 16 | 120.665 | 126.6776 | nan | 131.7448 | 83.8727 | 81.3108 | | DebertaForQuestionAnswering | 8 | 81.9254 | 83.7751 | 119.828 | 94.4281 | 77.9092 | 67.0568 | | ElectraForCausalLM | 32 | 105.6236 | 114.7142 | nan | 103.742 | 75.0375 | 72.8734 | | T5ForConditionalGeneration | 4 | 104.3724 | 111.2166 | 144.7017 | 97.0557 | 73.3675 | 73.022 | | T5Small | 4 | 104.1619 | 112.7229 | 144.1861 | 94.1051 | 73.0951 | 73.8011 | | GoogleFnet | 16 | 101.671 | 101.6809 | nan | 65.9948 | 70.5277 | 65.1638 | | XGLMForCausalLM | 8 | 78.9468 | 85.0135 | 109.0319 | 256.93 | 70.1915 | 69.9348 | | BlenderbotSmallForCausalLM | 64 | 64.6761 | 71.9471 | 94.708 | 70.5196 | 67.6934 | 65.4853 | | Speech2Text2ForCausalLM | 256 | 63.6024 | 68.9631 | 97.6053 | 67.9677 | 64.5208 | 62.2234 | | GPT2ForSequenceClassification | 4 | 103.0274 | 104.5771 | nan | 144.3479 | 57.6241 | 58.2457 | | MT5ForConditionalGeneration | 16 | 87.9092 | 105.2152 | 96.8417 | 85.2268 | 57.5999 | 59.41 | | DebertaForMaskedLM | 4 | 66.3811 | 74.2774 | 83.1315 | 94.0724 | 55.7966 | 58.0268 | | TrOCRForCausalLM | 32 | 167.2324 | 174.5465 | nan | 174.8965 | nan | 163.1929 | | BlenderbotForCausalLM | 4 | 92.9117 | 95.2619 | nan | 97.2641 | nan | 92.2696 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9996 | 0.9734 | 0.8284 | 1.2914 | 1.8713 | 1.8297 | | lcnet_050 | 128 | 0.9557 | 0.9498 | 0.7684 | 1.3476 | 1.6411 | 1.62 | | regnety_002 | 128 | 0.9776 | 0.9993 | 0.8642 | 0.9666 | 1.4884 | 1.33 | | hrnet_w18 | 128 | 0.9997 | 0.9956 | 0.0 | 1.2718 | 1.4175 | 1.3801 | | dla102 | 128 | 1.0 | 1.0006 | 0.0 | 1.2839 | 1.3853 | 1.3694 | | volo_d1_224 | 64 | 0.9998 | 0.9935 | 0.8027 | 0.0 | 1.3795 | 1.3586 | | res2net50_14w_8s | 128 | 1.0 | 0.9991 | 0.0 | 1.2518 | 1.3573 | 1.3197 | | coat_lite_mini | 128 | 0.9999 | 1.0001 | 0.8462 | 1.0935 | 1.3463 | 1.3317 | | xcit_large_24_p8_224 | 5 | 1.0024 | 0.9967 | 0.7905 | 0.0 | 1.3417 | 1.2966 | | mobilenetv3_large_100 | 128 | 0.9656 | 0.9629 | 0.7647 | 1.286 | 1.3387 | 1.3434 | | mobilenetv2_100 | 128 | 0.966 | 0.9618 | 0.7071 | 1.2879 | 1.3373 | 1.3526 | | gluon_inception_v3 | 128 | 1.0 | 0.998 | 0.0 | 1.1293 | 1.3279 | 1.3091 | | adv_inception_v3 | 128 | 1.0 | 0.9987 | 0.0 | 1.1297 | 1.3278 | 1.3082 | | inception_v3 | 128 | 1.0 | 0.9965 | 0.0 | 1.1294 | 1.3265 | 1.3088 | | crossvit_9_240 | 128 | 0.9997 | 0.9986 | 0.7606 | 1.0401 | 1.326 | 1.3011 | | resnest101e | 64 | 1.0 | 1.0028 | 0.0 | 1.1684 | 1.3158 | 1.2691 | | res2next50 | 128 | 0.9998 | 1.0009 | 0.0 | 1.1808 | 1.3119 | 1.2746 | | fbnetv3_b | 128 | 0.9646 | 0.9612 | 0.7624 | 1.2368 | 1.2875 | 1.2976 | | botnet26t_256 | 128 | 0.9856 | 0.9847 | 0.7899 | 0.0 | 1.2778 | 1.2715 | | gmixer_24_224 | 128 | 0.9999 | 0.8345 | 0.0 | 1.0831 | 1.2718 | 1.2555 | | sebotnet33ts_256 | 64 | 0.9758 | 0.806 | 0.0 | 0.0 | 1.2678 | 1.2723 | | selecsls42b | 128 | 0.9999 | 0.9986 | 0.816 | 1.2154 | 1.2677 | 1.2533 | | eca_botnext26ts_256 | 128 | 0.9863 | 0.7722 | 0.0 | 0.0 | 1.2644 | 1.2491 | | mnasnet_100 | 128 | 0.9669 | 0.9641 | 0.7859 | 1.2534 | 1.2596 | 1.2838 | | tf_efficientnet_b0 | 128 | 0.9768 | 0.7825 | 0.0 | 1.162 | 1.2551 | 1.268 | | eca_halonext26ts | 128 | 0.987 | 0.7789 | 0.0 | 0.0 | 1.2538 | 1.2467 | | fbnetc_100 | 128 | 0.9666 | 0.9629 | 0.7915 | 1.2464 | 1.2504 | 1.2666 | | ese_vovnet19b_dw | 128 | 0.9786 | 0.9775 | 0.7446 | 1.1501 | 1.2457 | 1.2484 | | spnasnet_100 | 128 | 0.9615 | 0.9571 | 0.7608 | 1.2242 | 1.2381 | 1.2567 | | cspdarknet53 | 64 | 0.9577 | 0.9504 | 0.736 | 1.1753 | 1.2365 | 1.2478 | | jx_nest_base | 32 | 0.9996 | 0.9933 | 0.737 | 0.0 | 1.233 | 1.229 | | res2net101_26w_4s | 64 | 0.9998 | 0.9965 | 0.7724 | 1.1097 | 1.2304 | 1.1871 | | gmlp_s16_224 | 128 | 0.9999 | 0.9992 | 0.0 | 1.0923 | 1.2137 | 1.2005 | | rexnet_100 | 128 | 0.9721 | 0.8157 | 0.0 | 1.1616 | 1.2135 | 1.2211 | | convit_base | 64 | 0.9997 | 0.9988 | 0.0 | 0.0 | 1.2116 | 1.2366 | | cait_m36_384 | 4 | 0.9998 | 0.9988 | 0.0 | 0.0 | 1.2108 | 1.1888 | | pnasnet5large | 16 | 0.9997 | 0.9953 | 0.0 | 1.0895 | 1.2078 | 1.1931 | | tinynet_a | 128 | 0.9647 | 0.7758 | 0.6217 | 1.1435 | 1.1903 | 1.2005 | | tf_mixnet_l | 128 | 0.9856 | 0.8898 | 0.0 | 1.0952 | 1.1889 | 1.1867 | | dpn107 | 32 | 0.9589 | 0.9504 | 0.7802 | 1.0273 | 1.1888 | 1.2031 | | dm_nfnet_f0 | 128 | 0.9995 | 0.9982 | 0.0 | 1.1382 | 1.1869 | 1.1571 | | pit_b_224 | 64 | 0.9999 | 0.9994 | 0.0 | 1.0323 | 1.1859 | 1.1742 | | twins_pcpvt_base | 64 | 1.0 | 0.9992 | 0.7494 | 0.0 | 1.1778 | 1.1509 | | mixnet_l | 128 | 0.9849 | 0.8855 | 0.0 | 1.099 | 1.1745 | 1.1748 | | mobilevit_s | 64 | 0.9795 | 0.7622 | 0.0 | 0.0 | 1.1726 | 1.1709 | | repvgg_a2 | 128 | 0.9631 | 0.9634 | 0.8283 | 1.1363 | 1.17 | 1.1705 | | poolformer_m36 | 64 | 0.9997 | 0.9993 | 0.0 | 0.0 | 1.1663 | 1.1483 | | nfnet_l0 | 128 | 0.9996 | 0.7885 | 0.0 | 1.1054 | 1.1467 | 1.1173 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9796 | 0.0 | 0.0 | 1.1282 | 1.1174 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9828 | 0.0 | 0.0 | 1.1135 | 1.1003 | | swsl_resnext101_32x16d | 32 | 0.9998 | 0.9981 | 0.0 | 1.1084 | 1.1075 | 1.0716 | | deit_base_distilled_patch16_224 | 64 | 0.9996 | 0.9991 | 0.7693 | 0.9792 | 1.097 | 1.081 | | gluon_xception65 | 32 | 0.9998 | 0.9977 | 0.0 | 1.0806 | 1.0871 | 1.075 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9989 | 0.7666 | 0.9508 | 1.0854 | 1.0717 | | convmixer_768_32 | 32 | 0.9998 | 1.0 | 0.0 | 0.0 | 1.077 | 1.0748 | | gernet_l | 128 | 0.974 | 0.9726 | 0.8243 | 1.0996 | 1.0768 | 1.0717 | | mixer_b16_224 | 128 | 0.9993 | 1.0002 | 0.0 | 0.893 | 1.074 | 1.0675 | | visformer_small | 128 | 0.9996 | 1.0027 | 0.7974 | 0.0 | 1.0418 | 1.011 | | convnext_base | 64 | 0.9998 | 0.9986 | 0.0 | 0.0 | 1.0364 | 1.0255 | | resmlp_12_224 | 128 | 0.9999 | 1.0008 | 0.6954 | 1.2124 | 0.9609 | 0.9963 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9992 | 0.0 | 0.0 | 0.0 | 1.5102 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.3573 | 25.5488 | nan | 693.2762 | 107.1353 | 98.876 | | twins_pcpvt_base | 64 | 2.2579 | 10.4934 | 18.7858 | nan | 81.3733 | 79.5034 | | swin_base_patch4_window7_224 | 64 | 2.7827 | 10.0891 | nan | nan | 77.062 | 74.5512 | | mobilevit_s | 64 | 1.703 | 5.9962 | nan | nan | 76.1518 | 75.7466 | | xcit_large_24_p8_224 | 5 | 2.9019 | 13.7927 | 26.6637 | nan | 75.5903 | 72.0025 | | pnasnet5large | 16 | 4.6679 | 18.9357 | nan | 353.2851 | 70.5357 | 66.4478 | | cait_m36_384 | 4 | 2.9894 | 14.7506 | nan | nan | 61.5122 | 59.328 | | coat_lite_mini | 128 | 1.013 | 4.0336 | 6.4982 | 96.3216 | 59.3632 | 58.8498 | | dm_nfnet_f0 | 128 | 2.0457 | 6.4838 | nan | 153.7786 | 58.6937 | 57.5822 | | resnest101e | 64 | 3.2146 | 12.9223 | nan | 267.027 | 56.3723 | 53.7663 | | jx_nest_base | 32 | 1.7481 | 7.5363 | 12.1256 | nan | 52.6756 | 51.4597 | | res2net101_26w_4s | 64 | 3.2605 | 13.7319 | 23.3002 | 253.6876 | 51.1642 | 48.9136 | | eca_halonext26ts | 128 | 1.3993 | 4.4673 | nan | nan | 50.6306 | 48.9852 | | res2net50_14w_8s | 128 | 2.7693 | 12.1849 | nan | 261.3981 | 46.8814 | 46.5081 | | poolformer_m36 | 64 | 1.7062 | 6.2381 | nan | nan | 45.1289 | 43.4322 | | nfnet_l0 | 128 | 1.7637 | 6.1161 | nan | 131.5811 | 45.1274 | 44.8345 | | convnext_base | 64 | 1.2947 | 5.3316 | nan | nan | 44.5987 | 43.3854 | | sebotnet33ts_256 | 64 | 1.632 | 5.4351 | nan | nan | 40.1668 | 39.4554 | | gmlp_s16_224 | 128 | 1.0642 | 5.1779 | nan | 155.1597 | 40.0289 | 38.0463 | | dpn107 | 32 | 4.1123 | 11.5639 | 35.4907 | 175.474 | 38.3657 | 35.4004 | | volo_d1_224 | 64 | 1.2775 | 6.1896 | 9.9177 | nan | 36.323 | 34.9135 | | crossvit_9_240 | 128 | 1.4657 | 6.4582 | 10.1939 | 170.6372 | 36.0901 | 34.646 | | gluon_xception65 | 32 | 1.9437 | 9.0429 | nan | 155.6789 | 34.7923 | 32.3923 | | fbnetv3_b | 128 | 3.0963 | 9.6243 | 24.6859 | 247.2532 | 34.5214 | 32.4389 | | eca_botnext26ts_256 | 128 | 1.3978 | 4.3179 | nan | nan | 33.903 | 32.8771 | | tf_mixnet_l | 128 | 5.7046 | 11.1915 | nan | 153.2379 | 31.8076 | 30.5494 | | gluon_inception_v3 | 128 | 1.6613 | 6.8425 | nan | 151.4115 | 31.7166 | 30.7371 | | inception_v3 | 128 | 1.591 | 7.1322 | nan | 151.1763 | 31.6961 | 30.0579 | | ghostnet_100 | 128 | 2.8783 | 8.1307 | 12.0476 | 161.5354 | 31.6664 | 29.6582 | | adv_inception_v3 | 128 | 1.6238 | 6.8117 | nan | 149.8725 | 31.632 | 29.9885 | | gmixer_24_224 | 128 | 1.1508 | 5.9707 | nan | 134.6312 | 31.1634 | 29.8255 | | mixnet_l | 128 | 5.4075 | 11.2514 | nan | 153.4172 | 30.6351 | 28.851 | | botnet26t_256 | 128 | 1.2816 | 3.6617 | 8.5146 | nan | 30.1481 | 29.9629 | | dla102 | 128 | 1.8285 | 7.7283 | nan | 188.403 | 29.6709 | 27.9881 | | swsl_resnext101_32x16d | 32 | 1.7746 | 7.4635 | nan | 131.488 | 29.0141 | 27.8372 | | convit_base | 64 | 1.0143 | 4.4818 | nan | nan | 27.2877 | 26.447 | | res2next50 | 128 | 1.6027 | 6.7226 | nan | 164.0915 | 27.2612 | 25.6634 | | rexnet_100 | 128 | 1.8869 | 6.4144 | nan | 146.0056 | 25.7268 | 24.1553 | | tinynet_a | 128 | 2.1638 | 6.6347 | 17.4858 | 146.6159 | 24.952 | 23.5223 | | tf_efficientnet_b0 | 128 | 1.7899 | 6.0125 | nan | 127.7474 | 22.6514 | 21.2568 | | mixer_b16_224 | 128 | 0.6191 | 2.5867 | nan | 70.2752 | 22.2787 | 21.9526 | | cspdarknet53 | 64 | 2.2432 | 6.401 | 16.6794 | 121.2672 | 21.8544 | 20.8219 | | resmlp_12_224 | 128 | 0.5681 | 2.2815 | 3.9742 | 32.7361 | 21.4944 | 20.0537 | | visformer_small | 128 | 0.8674 | 3.3509 | 5.2329 | nan | 21.2694 | 20.9008 | | convmixer_768_32 | 32 | 1.1672 | 5.0823 | nan | nan | 21.24 | 19.5919 | | fbnetc_100 | 128 | 2.029 | 5.5733 | 15.4518 | 111.8708 | 21.1037 | 19.7739 | | spnasnet_100 | 128 | 1.982 | 5.5606 | 15.903 | 107.8579 | 20.6115 | 19.6682 | | pit_b_224 | 64 | 0.9112 | 3.9477 | nan | 94.8251 | 19.9032 | 19.199 | | beit_base_patch16_224 | 64 | 1.0879 | 4.2344 | nan | nan | 19.3724 | 18.437 | | mobilenetv3_large_100 | 128 | 1.565 | 4.6789 | 12.0067 | 119.7012 | 19.1999 | 19.1213 | | deit_base_distilled_patch16_224 | 64 | 0.8339 | 3.4479 | 5.7338 | 75.6456 | 18.9902 | 18.3684 | | vit_base_patch16_224 | 64 | 0.8172 | 3.363 | 5.4897 | 72.0475 | 18.939 | 17.8768 | | mnasnet_100 | 128 | 1.5692 | 4.4416 | 12.2276 | 88.0319 | 18.5839 | 16.7329 | | mobilenetv2_100 | 128 | 1.5973 | 4.9662 | 11.7876 | 101.5904 | 18.1763 | 17.2542 | | repvgg_a2 | 128 | 2.0342 | 5.2028 | 14.1049 | 154.4154 | 17.8437 | 17.0276 | | regnety_002 | 128 | 1.5833 | 4.6022 | 11.3593 | 94.3934 | 17.1616 | 16.2232 | | gernet_l | 128 | 1.9982 | 5.204 | 13.7079 | 89.1493 | 17.094 | 16.5453 | | selecsls42b | 128 | 0.7685 | 3.0317 | 4.8776 | 75.4265 | 15.7883 | 14.5892 | | lcnet_050 | 128 | 0.9444 | 2.839 | 6.6192 | 69.2402 | 13.7266 | 12.0846 | | ese_vovnet19b_dw | 128 | 0.9391 | 2.487 | 6.1181 | 52.8025 | 12.1538 | 11.6456 | | tnt_s_patch16_224 | 128 | 1.6686 | 8.1997 | nan | nan | nan | 32.7806 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | 0.9898 | 1.351 | 1.5843 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1877 | 1.3424 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1378 | 1.2737 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.2529 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.1264 | 1.3578 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1802 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.0689 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9882 | 1.0887 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9765 | 1.1445 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9659 | 1.0598 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9633 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9468 | 1.1098 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 0.9435 | 1.0967 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.988 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0634 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0516 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9065 | 1.0615 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9058 | 0.9905 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.8578 | 1.0369 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8526 | 1.0752 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8213 | 1.0246 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8189 | 0.9416 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7449 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6742 | 0.9001 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8633 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 365.2449 | 365.1385 | nan | nan | 339.3262 | 339.6721 | | hrnet_w18 | 128 | 416.2144 | 417.6109 | nan | 327.4556 | 294.0177 | 301.3445 | | convnext_base | 64 | 262.2177 | 262.4688 | nan | nan | 252.9346 | 255.2069 | | pnasnet5large | 16 | 289.3786 | 290.4498 | nan | 265.2668 | 239.0365 | 242.1209 | | tf_mixnet_l | 128 | 256.8103 | 284.257 | nan | 231.0925 | 212.9444 | 213.3224 | | swin_base_patch4_window7_224 | 64 | 237.0253 | 241.8202 | nan | nan | 210.306 | 211.9326 | | mixnet_l | 128 | 247.2809 | 274.9755 | nan | 221.7176 | 207.3362 | 207.2414 | | swsl_resnext101_32x16d | 32 | 218.9742 | 219.1772 | nan | 197.2857 | 197.7743 | 204.6102 | | dla102 | 128 | 269.4547 | 269.2504 | nan | 209.6874 | 194.6141 | 196.5766 | | cait_m36_384 | 4 | 216.2917 | 216.1503 | nan | nan | 178.4078 | 181.6447 | | resnest101e | 64 | 229.9118 | 228.9428 | nan | 196.6649 | 175.1415 | 180.9784 | | dm_nfnet_f0 | 128 | 205.6469 | 206.1503 | nan | 180.8949 | 173.4391 | 177.5319 | | inception_v3 | 128 | 226.5828 | 227.4265 | nan | 200.5104 | 170.7439 | 173.0182 | | gluon_inception_v3 | 128 | 226.8462 | 227.0067 | nan | 200.6569 | 170.6275 | 172.9726 | | adv_inception_v3 | 128 | 226.7251 | 226.8467 | nan | 200.3614 | 170.5792 | 173.0214 | | res2net50_14w_8s | 128 | 229.3045 | 229.4047 | nan | 182.9485 | 168.9918 | 173.8081 | | gluon_xception65 | 32 | 182.6107 | 182.985 | nan | 168.8993 | 167.8716 | 169.613 | | convit_base | 64 | 196.4105 | 196.5373 | nan | nan | 162.0422 | 158.7674 | | res2next50 | 128 | 206.5363 | 206.557 | nan | 175.1777 | 157.663 | 162.1321 | | dpn107 | 32 | 190.9098 | 192.2247 | 234.6927 | 178.0332 | 153.7226 | 152.0514 | | nfnet_l0 | 128 | 176.1781 | 223.2454 | nan | 158.9809 | 153.1947 | 157.6091 | | gernet_l | 128 | 165.1697 | 165.3888 | 193.3319 | 146.2431 | 149.5091 | 149.861 | | poolformer_m36 | 64 | 174.2403 | 174.5895 | nan | nan | 149.4633 | 151.7195 | | mixer_b16_224 | 128 | 158.8342 | 158.9143 | nan | 176.6859 | 148.3098 | 149.4129 | | coat_lite_mini | 128 | 191.3531 | 191.1886 | 226.1053 | 174.9854 | 142.1106 | 143.5909 | | pit_b_224 | 64 | 158.4075 | 158.51 | nan | 153.2859 | 133.6352 | 134.8735 | | eca_halonext26ts | 128 | 169.0584 | 214.583 | nan | nan | 133.2252 | 134.024 | | eca_botnext26ts_256 | 128 | 163.2987 | 208.4732 | nan | nan | 127.2781 | 128.8657 | | gmlp_s16_224 | 128 | 151.983 | 151.9504 | nan | 139.0381 | 125.2074 | 126.4829 | | res2net101_26w_4s | 64 | 151.8315 | 152.2954 | 196.6574 | 136.8084 | 123.5716 | 127.8254 | | visformer_small | 128 | 128.1312 | 128.1133 | 160.7961 | nan | 123.1207 | 127.1849 | | fbnetv3_b | 128 | 162.4823 | 163.1936 | 205.7683 | 126.7176 | 122.0562 | 120.9422 | | botnet26t_256 | 128 | 152.2058 | 152.3965 | 190.1744 | nan | 117.3797 | 117.9923 | | twins_pcpvt_base | 64 | 137.2842 | 137.0642 | 182.8962 | nan | 116.6342 | 119.2926 | | beit_base_patch16_224 | 64 | 128.3316 | 130.5795 | nan | nan | 115.9751 | 116.6485 | | gmixer_24_224 | 128 | 146.1359 | 175.0908 | nan | 135.0285 | 115.0529 | 116.5205 | | volo_d1_224 | 64 | 153.3902 | 154.3301 | 191.0555 | nan | 111.3824 | 112.6672 | | vit_base_patch16_224 | 64 | 119.7254 | 119.7896 | 155.3419 | 125.121 | 110.5162 | 111.012 | | deit_base_distilled_patch16_224 | 64 | 120.3868 | 119.8631 | 156.4319 | 122.3402 | 109.5755 | 111.5884 | | repvgg_a2 | 128 | 127.3663 | 127.1853 | 146.2748 | 108.0181 | 104.7928 | 104.7586 | | tf_efficientnet_b0 | 128 | 133.8864 | 167.4238 | nan | 112.6635 | 104.3814 | 103.2331 | | xcit_large_24_p8_224 | 5 | 134.915 | 136.4061 | 173.3407 | nan | 101.5258 | 104.2685 | | cspdarknet53 | 64 | 130.4266 | 131.4319 | 169.5797 | 106.2495 | 100.8762 | 100.0489 | | jx_nest_base | 32 | 121.169 | 122.0422 | 164.5882 | nan | 98.4146 | 98.4737 | | mobilevit_s | 64 | 116.9568 | 150.3464 | nan | nan | 97.6438 | 97.9444 | | rexnet_100 | 128 | 119.2966 | 142.2692 | nan | 99.8158 | 95.4623 | 94.9091 | | fbnetc_100 | 128 | 123.3901 | 123.8196 | 150.8029 | 95.7084 | 95.4515 | 94.1416 | | tinynet_a | 128 | 110.4414 | 136.9773 | 171.1432 | 92.991 | 89.254 | 88.5431 | | sebotnet33ts_256 | 64 | 114.3632 | 138.571 | nan | nan | 88.0929 | 87.7369 | | spnasnet_100 | 128 | 105.9349 | 106.5409 | 134.2471 | 83.1704 | 82.3086 | 81.1135 | | ese_vovnet19b_dw | 128 | 99.5586 | 99.8998 | 131.0212 | 84.7902 | 78.4098 | 78.2165 | | mnasnet_100 | 128 | 98.6364 | 98.9977 | 121.6606 | 76.1373 | 75.8107 | 74.2815 | | crossvit_9_240 | 128 | 98.4784 | 98.4866 | 129.5346 | 94.5777 | 74.3851 | 75.6689 | | resmlp_12_224 | 128 | 71.1369 | 71.0422 | 102.5851 | 58.6911 | 74.17 | 71.4085 | | selecsls42b | 128 | 89.6311 | 89.6902 | 109.8468 | 73.7691 | 70.6494 | 71.4626 | | mobilenetv2_100 | 128 | 97.7231 | 98.1639 | 133.4936 | 73.3339 | 70.5206 | 69.7844 | | mobilenetv3_large_100 | 128 | 85.6361 | 85.9117 | 108.116 | 64.2955 | 61.7982 | 61.5247 | | ghostnet_100 | 128 | 114.5688 | 117.7893 | 138.6022 | 88.6982 | 61.4212 | 62.6999 | | regnety_002 | 128 | 53.5535 | 51.825 | 60.422 | 54.1692 | 34.9779 | 39.4369 | | lcnet_050 | 128 | 38.3678 | 38.5738 | 47.7341 | 27.2099 | 22.3274 | 22.6081 | | tnt_s_patch16_224 | 128 | 471.1638 | 471.3024 | nan | nan | nan | 312.0398 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/YdvY9s4.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/iK7XPEk.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/bmruZvF.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 54/56 | 91%, 43/47  | 98%, 60/61  |
|       aot_eager        | 91%, 51/56 | 91%, 43/47  | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 41/56 | 34%, 16/47  | 46%, 28/61  |
|    nvprims_nvfuser     | 75%, 42/56 | 57%, 27/47  | 67%, 41/61  |
|        inductor        | 82%, 46/56 | 83%, 39/47  | 93%, 57/61  |
| inductor_no_cudagraphs | 89%, 50/56 | 87%, 41/47  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.00x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.04x    |    1.14x    |
|        inductor        |   1.47x    |    1.23x    |    1.23x    |
| inductor_no_cudagraphs |   1.23x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.80    |    2.47     |    2.00     |
|       aot_eager        |    5.51    |    7.78     |    7.13     |
|     aot_cudagraphs     |    7.44    |    15.39    |    13.04    |
|    nvprims_nvfuser     |   65.48    |   100.68    |   147.36    |
|        inductor        |   30.47    |    33.00    |    35.82    |
| inductor_no_cudagraphs |   29.78    |    28.80    |    34.59    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.92x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.91x    |    1.02x    |    0.95x    |
|        inductor        |   0.84x    |    0.75x    |    0.98x    |
| inductor_no_cudagraphs |   0.98x    |    1.00x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Previous report name: /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_float32_102 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 84%, 47/56 | 84%, 47/56 | | inductor | huggingface | 91%, 39/43 | 83%, 39/47 | | inductor | timm_models | 93%, 57/61 | 93%, 57/61 | | inductor_no_cudagraphs | torchbench | 91%, 51/56 | 89%, 50/56 | | inductor_no_cudagraphs | huggingface | 91%, 39/43 | 87%, 41/47 | | inductor_no_cudagraphs | timm_models | 93%, 57/61 | 93%, 57/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.52x | 1.47x | | inductor | huggingface | 1.30x | 1.25x | | inductor | timm_models | 1.24x | 1.23x | | inductor_no_cudagraphs | torchbench | 1.23x | 1.23x | | inductor_no_cudagraphs | huggingface | 1.23x | 1.22x | | inductor_no_cudagraphs | timm_models | 1.24x | 1.23x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+---------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v2_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | huggingface | DebertaV2ForMaskedLM | fail_to_run | fail_to_run | | huggingface | BlenderbotForCausalLM | 0.0000 | 0.0000 | | timm_models | deit_base_distilled_patch16_224 | pass | fail_accuracy | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | +-------------+---------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | soft_actor_critic | 1.4265 | 0.9323 | | torchbench | nvidia_deeprecommender | 0.9047 | 0.9645 | | torchbench | dlrm | 0.0 | 1.1453 | | torchbench | hf_GPT2_large | 0.0 | 1.4677 | | torchbench | hf_T5 | 0.0 | 1.5654 | | torchbench | tacotron2 | 0.0 | 0.9184 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 0.9984 | 0.8567 | | huggingface | DebertaV2ForQuestionAnswering | 0.9236 | 0.9097 | | huggingface | TrOCRForCausalLM | 0.0 | 1.027 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.0114 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5083 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | yolov3 | 359.391 | 362.3109 | | torchbench | timm_efficientdet | 124.0955 | 121.8376 | | huggingface | DebertaV2ForQuestionAnswering | 161.7475 | 46.9759 | | huggingface | DebertaV2ForMaskedLM | 161.278 | 47.8922 | | huggingface | XLNetLMHeadModel | 126.2002 | 121.323 | +-------------+-------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0023 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8738 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | pytorch_unet | 0.8484 | 1.0138 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6881 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4213 | 0.4334 | | torchbench | dcgan | 0.2564 | 0.2576 | | torchbench | dlrm | nan | 0.7306 | | huggingface | YituTechConvBert | 0.894 | 0.9822 | | huggingface | DistillGPT2 | 0.8939 | 1.0108 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | PegasusForConditionalGeneration | 0.8637 | 1.0262 | | huggingface | M2M100ForConditionalGeneration | 0.8492 | 1.0061 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | PLBartForCausalLM | 0.8367 | 1.0581 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | T5Small | 0.8215 | 1.1049 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | MBartForConditionalGeneration | 0.7896 | 0.9837 | | huggingface | MT5ForConditionalGeneration | 0.7785 | 0.9242 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.958 | | huggingface | MegatronBertForQuestionAnswering | 0.7709 | 1.0379 | | huggingface | MegatronBertForCausalLM | 0.7673 | 1.0153 | | huggingface | MBartForCausalLM | 0.7326 | 0.9478 | | huggingface | RobertaForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | BertForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | BertForMaskedLM | 0.6945 | 0.9772 | | huggingface | RobertaForCausalLM | 0.6942 | 0.9771 | | huggingface | CamemBert | 0.6942 | 0.9746 | | huggingface | Speech2Text2ForCausalLM | 0.675 | 0.9168 | | huggingface | DistilBertForQuestionAnswering | 0.6589 | 0.9118 | | huggingface | DistilBertForMaskedLM | 0.6509 | 0.9194 | | huggingface | Reformer | 0.573 | 1.0028 | | huggingface | DebertaV2ForMaskedLM | 0.5682 | 0.9491 | | huggingface | MobileBertForMaskedLM | 0.4951 | 0.6649 | | huggingface | DebertaV2ForQuestionAnswering | 0.4735 | 0.984 | | huggingface | MobileBertForQuestionAnswering | 0.4145 | 0.535 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | BlenderbotForCausalLM | nan | 0.8509 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8822 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | convnext_base | 0.8578 | 1.0369 | | timm_models | pit_b_224 | 0.8526 | 1.0752 | | timm_models | coat_lite_mini | 0.8212 | 1.0246 | | timm_models | sebotnet33ts_256 | 0.8189 | 0.9416 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7449 | 0.9008 | | timm_models | crossvit_9_240 | 0.6742 | 0.9001 | | timm_models | tnt_s_patch16_224 | nan | 0.8633 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Performance speedup regressions ~~~ +------------------------+-------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------+-------------+------------+ | inductor | dlrm | 0.96 | 0.0 | | inductor_no_cudagraphs | soft_actor_critic | 0.9507 | 0.9323 | +------------------------+-------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+-------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------+-------------+------------+ | inductor_no_cudagraphs | timm_efficientdet | 119.7085 | 121.8376 | +------------------------+-------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 No regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.001 | 1.0326 | 2.2901 | 0.8015 | 5.1793 | 1.2944 | | timm_efficientdet | 1 | 0.9819 | 0.8856 | 1.8425 | 0.7875 | 4.3397 | 1.4957 | | timm_vision_transformer | 8 | 0.9824 | 0.9471 | 1.5052 | 0.6609 | 2.6071 | 1.422 | | drq | 1 | 1.01 | 0.8714 | 1.6282 | 0.7572 | 2.4749 | 1.0657 | | BERT_pytorch | 16 | 1.0084 | 0.9013 | 1.1025 | 0.9413 | 2.0737 | 2.054 | | resnext50_32x4d | 8 | 1.0017 | 1.1059 | 1.2627 | 0.8384 | 2.0648 | 1.2118 | | mobilenet_v3_large | 32 | 1.0046 | 1.1145 | 1.0307 | 0.8543 | 1.9711 | 1.3386 | | dcgan | 32 | 0.9885 | 1.0156 | 1.2487 | 0.8056 | 1.9204 | 1.0233 | | resnet18 | 16 | 1.0018 | 1.1157 | 1.1603 | 0.8819 | 1.8482 | 1.2455 | | pytorch_struct | 200 | 0.9915 | 0.7629 | 0.8713 | 0.763 | 1.7957 | 1.1358 | | lennard_jones | 1000 | 0.9582 | 0.8632 | 1.0251 | 0.7044 | 1.7762 | 0.9581 | | squeezenet1_1 | 32 | 0.9986 | 1.0092 | 1.0319 | 0.878 | 1.7403 | 1.2611 | | shufflenet_v2_x1_0 | 128 | 1.0007 | 1.0495 | 0.8026 | 0.8864 | 1.6629 | 1.4319 | | hf_T5_large | 2 | 1.0255 | 0.9086 | 0.0 | 0.0 | 1.6489 | 1.6153 | | hf_Albert | 8 | 1.0007 | 0.9988 | 0.7534 | 1.5526 | 1.6455 | 1.6406 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9966 | 1.0198 | 1.2137 | 0.8559 | 1.6382 | 1.4758 | | timm_resnest | 32 | 0.9995 | 1.0024 | 0.8055 | 1.1649 | 1.5185 | 1.4518 | | hf_GPT2 | 4 | 1.0081 | 0.9824 | 0.7378 | 0.4034 | 1.5113 | 1.5018 | | mnasnet1_0 | 32 | 1.0002 | 1.0991 | 0.865 | 0.935 | 1.4613 | 1.2734 | | speech_transformer | 32 | 0.9934 | 0.8845 | 1.3547 | 0.7547 | 1.4353 | 1.4495 | | mobilenet_v2 | 96 | 0.9997 | 0.999 | 0.7309 | 1.334 | 1.4289 | 1.4082 | | soft_actor_critic | 256 | 0.961 | 0.7971 | 1.0559 | 0.6908 | 1.4265 | 0.9323 | | fastNLP_Bert | 6 | 0.9976 | 0.9782 | 0.7518 | 1.1497 | 1.4196 | 1.388 | | timm_efficientnet | 32 | 0.9548 | 0.8117 | 0.6934 | 0.8292 | 1.3493 | 1.1989 | | pytorch_stargan | 16 | 0.9987 | 1.0752 | 0.9376 | 0.0 | 1.2663 | 1.2247 | | LearningToPaint | 96 | 1.0031 | 1.0593 | 0.873 | 0.9583 | 1.247 | 1.2008 | | resnet152 | 32 | 1.0012 | 1.0508 | 0.7963 | 0.9136 | 1.2307 | 1.2022 | | hf_Bert | 4 | 1.02 | 0.9958 | 0.7284 | 0.8728 | 1.2029 | 1.1853 | | hf_Bart | 4 | 1.0127 | 0.9725 | 0.7354 | 0.8589 | 1.2022 | 1.1962 | | resnet50 | 32 | 0.9989 | 0.9882 | 0.7609 | 0.9944 | 1.2021 | 1.1691 | | pytorch_unet | 1 | 0.9997 | 0.2824 | 0.0 | 0.0 | 1.1989 | 1.1864 | | timm_nfnet | 128 | 0.9995 | 0.9999 | 0.0 | 1.1337 | 1.1903 | 1.1575 | | hf_DistilBert | 8 | 1.0004 | 0.9565 | 0.6864 | 0.5267 | 1.1757 | 1.1828 | | vgg16 | 64 | 0.9997 | 0.9989 | 0.859 | 0.9979 | 1.1708 | 1.1653 | | alexnet | 128 | 0.9985 | 0.9978 | 0.8046 | 1.0051 | 1.1626 | 1.164 | | Super_SloMo | 6 | 0.9998 | 0.2431 | 0.0 | 0.247 | 1.1436 | 1.1292 | | hf_Reformer | 4 | 0.9981 | 1.0015 | 0.9893 | 0.7359 | 1.1316 | 1.1405 | | timm_regnet | 32 | 0.9662 | 0.9625 | 0.7804 | 1.0945 | 1.1255 | 1.0914 | | mobilenet_v2_quantized_qat | 96 | 1.0005 | 0.9769 | 0.0 | 1.4581 | 1.1135 | 1.0423 | | yolov3 | 16 | 0.9998 | 0.9945 | 0.791 | 1.1519 | 1.0856 | 1.0727 | | Background_Matting | 4 | 1.0002 | 0.1925 | 0.0 | 0.0 | 1.0833 | 1.0748 | | attention_is_all_you_need_pytorch | 256 | 0.9998 | 0.9701 | 0.7571 | 0.9395 | 1.0526 | 1.0385 | | timm_vision_transformer_large | 8 | 1.0 | 0.9941 | 0.0 | 0.0 | 1.0425 | 1.0333 | | timm_vovnet | 32 | 0.9111 | 0.9065 | 0.7159 | 0.9017 | 1.0052 | 1.0199 | | tts_angular | 64 | 0.9868 | 0.9602 | 0.9802 | 0.967 | 1.0025 | 1.0047 | | demucs | 4 | 1.0 | 1.0002 | 0.9999 | 0.9998 | 0.9998 | 0.9997 | | resnet50_quantized_qat | 32 | 1.0005 | 0.9671 | 0.0 | 1.1521 | 0.9652 | 1.0383 | | nvidia_deeprecommender | 256 | 0.9992 | 0.9636 | 0.5852 | 0.9761 | 0.9047 | 0.9645 | | dlrm | 2048 | 1.2071 | 0.0 | 0.0 | 0.0 | 0.0 | 1.1453 | | hf_GPT2_large | 4 | 0.9998 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4677 | | hf_T5 | 8 | 1.0011 | 0.9541 | 0.0 | 1.1631 | 0.0 | 1.5654 | | tacotron2 | 64 | 0.9626 | 0.8708 | 0.0 | 0.7715 | 0.0 | 0.9184 | | functorch_dp_cifar10 | 64 | 1.0027 | 1.0289 | 2.1562 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9588 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.9062 | 7.4526 | 10.1644 | 114.0784 | 359.391 | 362.3109 | | timm_efficientdet | 1 | 19.5258 | 33.4756 | 67.153 | 494.4588 | 124.0955 | 121.8376 | | hf_T5_large | 2 | 13.8213 | 34.0918 | nan | nan | 106.0912 | 103.5459 | | mobilenet_v2_quantized_qat | 96 | 1.3203 | 7.3616 | nan | 183.0354 | 83.5192 | 82.9561 | | resnet50_quantized_qat | 32 | 1.1855 | 7.0724 | nan | 168.1103 | 68.5016 | 68.5988 | | timm_nfnet | 128 | 2.036 | 6.1023 | nan | 153.6497 | 57.339 | 57.0511 | | timm_vision_transformer_large | 8 | 2.4377 | 11.3154 | nan | nan | 53.0011 | 51.5963 | | attention_is_all_you_need_pytorch | 256 | 1.1808 | 5.5194 | 8.9874 | 116.6466 | 41.9014 | 41.254 | | densenet121 | 4 | 2.174 | 10.1229 | 15.7802 | 169.7616 | 41.4931 | 40.2615 | | resnet152 | 32 | 2.3957 | 10.7034 | 17.4771 | 192.1728 | 40.6663 | 38.8696 | | timm_resnest | 32 | 0.5712 | 2.0491 | 3.0474 | 59.6763 | 40.3271 | 39.7968 | | timm_vision_transformer | 8 | 0.8575 | 3.393 | 4.8105 | 77.314 | 32.36 | 31.676 | | hf_Bart | 4 | 1.6194 | 6.4801 | 10.0345 | 121.1289 | 29.2399 | 28.327 | | pytorch_stargan | 16 | 0.4216 | 1.7111 | 2.4119 | nan | 27.1742 | 25.5587 | | BERT_pytorch | 16 | 1.5596 | 5.8846 | 8.8807 | 84.7862 | 26.4595 | 26.1853 | | fastNLP_Bert | 6 | 1.5676 | 5.4936 | 8.5329 | 77.4894 | 26.2906 | 24.9376 | | speech_transformer | 32 | 1.6599 | 6.6508 | 26.4287 | 131.5275 | 23.7993 | 23.1606 | | timm_regnet | 32 | 2.2326 | 6.6122 | 16.8842 | 113.8304 | 23.163 | 22.4133 | | timm_efficientnet | 32 | 1.7462 | 5.57 | 13.6037 | 109.5624 | 23.0852 | 22.0122 | | mobilenet_v3_large | 32 | 0.8974 | 3.9804 | 5.8092 | 100.787 | 22.8962 | 21.9957 | | pytorch_struct | 200 | 0.2531 | 0.6197 | 1.1509 | 4.4402 | 19.3213 | 18.9017 | | hf_Bert | 4 | 1.5729 | 5.1707 | 7.7095 | 85.525 | 18.7916 | 18.3062 | | mnasnet1_0 | 32 | 0.8311 | 3.5229 | 5.2064 | 73.1048 | 18.1847 | 17.6724 | | Super_SloMo | 6 | 1.0547 | 6.5419 | nan | 56.393 | 18.1718 | 17.63 | | shufflenet_v2_x1_0 | 128 | 0.9796 | 4.2017 | 6.1678 | 90.0979 | 17.9051 | 16.7399 | | hf_Reformer | 4 | 1.5772 | 2.6258 | 4.8698 | 14.9941 | 17.5935 | 15.2173 | | resnet50 | 32 | 0.8414 | 3.8226 | 5.6048 | 79.0801 | 17.4813 | 16.9996 | | hf_Albert | 8 | 1.2107 | 4.5704 | 7.179 | 111.4697 | 17.4256 | 16.5849 | | hf_GPT2 | 4 | 1.4656 | 5.1169 | 7.7577 | 61.5296 | 17.1231 | 16.4083 | | timm_vovnet | 32 | 1.4676 | 3.7947 | 8.945 | 56.636 | 17.0594 | 17.3533 | | resnext50_32x4d | 8 | 0.9006 | 3.7106 | 5.7349 | 63.4456 | 16.8902 | 16.4538 | | Background_Matting | 4 | 0.7804 | 7.7201 | nan | nan | 16.1129 | 15.5365 | | mobilenet_v2 | 96 | 0.8132 | 3.7197 | 5.9522 | 99.9094 | 15.8043 | 15.6617 | | hf_DistilBert | 8 | 0.6718 | 2.6609 | 4.6757 | 44.3428 | 11.6025 | 11.241 | | resnet18 | 16 | 0.4342 | 1.4924 | 2.0992 | 29.963 | 10.496 | 10.4965 | | pytorch_unet | 1 | 0.4489 | 2.7742 | nan | nan | 8.2821 | 8.216 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.402 | 1.5971 | 2.2446 | 31.4726 | 7.8677 | 7.7453 | | LearningToPaint | 96 | 0.4363 | 1.5611 | 2.383 | 38.7982 | 6.8337 | 6.4148 | | dcgan | 32 | 0.1775 | 0.363 | 0.5688 | 4.3191 | 5.9095 | 5.6206 | | drq | 1 | 0.3078 | 0.5222 | 0.7579 | 4.5283 | 3.8262 | 3.0714 | | squeezenet1_1 | 32 | 0.2312 | 0.6549 | 0.9979 | 4.4075 | 3.8029 | 3.5108 | | vgg16 | 64 | 0.1838 | 0.4881 | 0.7727 | 2.9264 | 3.3172 | 3.1845 | | nvidia_deeprecommender | 256 | 0.1978 | 0.3745 | 0.6201 | 5.0962 | 3.1652 | 2.9145 | | soft_actor_critic | 256 | 0.213 | 0.3049 | 0.5131 | 1.6091 | 3.1632 | 2.6797 | | alexnet | 128 | 0.1538 | 0.3275 | 0.5545 | 2.9188 | 2.8278 | 2.5849 | | lennard_jones | 1000 | 0.1443 | 0.2474 | 0.3867 | 1.3724 | 1.9226 | 1.7091 | | tts_angular | 64 | 0.1761 | 0.2248 | 0.3455 | 1.064 | 1.8298 | 1.6281 | | demucs | 4 | 0.2931 | 0.3059 | 0.3116 | 0.3163 | 0.2091 | 0.2101 | | hf_GPT2_large | 4 | 5.2498 | 15.8829 | nan | nan | nan | 43.0507 | | tacotron2 | 64 | 4.8587 | 14.033 | nan | 45.1257 | nan | 42.8269 | | hf_T5 | 8 | 2.6475 | 7.5176 | nan | 76.2533 | nan | 26.8939 | | dlrm | 2048 | 0.4415 | nan | nan | nan | nan | 2.8362 | | functorch_dp_cifar10 | 64 | 0.3007 | 1.1129 | 1.6749 | nan | nan | nan | | hf_BigBird | 2 | 3.5921 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5274 | 1.5274 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4604 | 1.4604 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.988 | 1.3107 | 1.3923 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.289 | 1.1343 | 1.1162 | 1.1442 | | Super_SloMo | 6 | 1.0024 | 0.902 | nan | 0.9454 | 1.1137 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 1.0433 | 1.1066 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2738 | 1.1206 | 1.0376 | 1.0443 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 0.9435 | 1.0968 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9407 | 1.0831 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9342 | 1.0395 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.9229 | 1.1042 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.392 | 0.8945 | 0.9173 | 0.9986 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9068 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9927 | 0.88 | 0.3236 | 0.7926 | 0.8982 | 1.0023 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8738 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8564 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.8484 | 1.0138 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8779 | 0.3388 | 1.0867 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3505 | 1.1277 | 0.826 | 1.0815 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3304 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6268 | 0.6881 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.5934 | 0.9995 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4213 | 0.4334 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1617 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1439 | | dlrm | 2048 | 0.7302 | nan | nan | nan | nan | 0.7306 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 199.4215 | 198.939 | nan | nan | 193.1716 | 191.1201 | | timm_nfnet | 128 | 205.744 | 206.3518 | nan | 181.2193 | 173.0975 | 177.9709 | | Background_Matting | 4 | 186.5919 | 968.056 | nan | nan | 172.185 | 173.5688 | | mobilenet_v2_quantized_qat | 96 | 148.836 | 150.6979 | nan | 101.0844 | 132.294 | 141.2698 | | hf_T5_large | 2 | 184.7712 | 209.8818 | nan | nan | 118.9025 | 123.2848 | | Super_SloMo | 6 | 117.5538 | 483.2044 | nan | 475.5894 | 102.7651 | 104.0397 | | resnet50_quantized_qat | 32 | 93.2013 | 96.6134 | nan | 81.0453 | 97.3958 | 90.5831 | | yolov3 | 16 | 102.3492 | 102.6725 | 129.1054 | 88.8496 | 94.34 | 95.2968 | | vgg16 | 64 | 106.5221 | 106.285 | 123.926 | 106.5507 | 90.6861 | 91.1346 | | timm_regnet | 32 | 101.2378 | 101.5544 | 125.8221 | 89.1176 | 86.7068 | 89.3252 | | demucs | 4 | 77.7271 | 77.8151 | 77.786 | 77.6678 | 77.7799 | 77.9115 | | resnet152 | 32 | 89.3431 | 85.3778 | 113.2946 | 98.8589 | 73.9044 | 76.0577 | | hf_Reformer | 4 | 83.3325 | 83.1656 | 84.0928 | 113.4191 | 73.602 | 72.9807 | | attention_is_all_you_need_pytorch | 256 | 72.0085 | 74.163 | 95.7598 | 76.5662 | 68.6387 | 69.3768 | | mobilenet_v2 | 96 | 71.4442 | 71.4875 | 97.6414 | 53.4626 | 49.9142 | 50.7228 | | pytorch_unet | 1 | 58.4724 | 207.0752 | nan | nan | 48.8804 | 49.3462 | | hf_Bart | 4 | 53.9351 | 56.67 | 74.7151 | 63.7626 | 45.9948 | 46.135 | | hf_Albert | 8 | 74.9558 | 75.3039 | 100.0051 | 48.3348 | 45.7229 | 45.7187 | | fastNLP_Bert | 6 | 60.1836 | 60.8351 | 79.3263 | 51.7512 | 42.0992 | 42.9998 | | timm_vovnet | 32 | 42.3444 | 42.5887 | 54.1862 | 42.7134 | 38.2698 | 37.712 | | speech_transformer | 32 | 48.9444 | 55.2824 | 35.7325 | 73.2547 | 35.431 | 35.6886 | | hf_GPT2 | 4 | 49.7409 | 51.2747 | 68.2354 | 124.5981 | 33.7041 | 33.4772 | | hf_DistilBert | 8 | 38.8576 | 40.679 | 56.6931 | 73.8172 | 33.1813 | 32.9038 | | hf_Bert | 4 | 38.0169 | 38.9993 | 53.3991 | 44.3536 | 32.4235 | 32.9376 | | timm_efficientdet | 1 | 138.185 | 153.5658 | 75.6554 | 190.0643 | 32.4181 | 92.9624 | | resnet50 | 32 | 38.7532 | 39.1694 | 50.9795 | 38.8084 | 32.2003 | 33.0348 | | timm_efficientnet | 32 | 44.1816 | 52.0185 | 61.2031 | 54.4272 | 32.1069 | 36.2476 | | shufflenet_v2_x1_0 | 128 | 37.079 | 35.1206 | 45.9071 | 43.8365 | 23.6239 | 26.3401 | | BERT_pytorch | 16 | 45.7937 | 51.0123 | 42.1713 | 47.9458 | 22.8844 | 23.1823 | | timm_resnest | 32 | 31.7129 | 31.6229 | 39.379 | 27.0632 | 20.8239 | 21.7499 | | mnasnet1_0 | 32 | 29.3064 | 25.962 | 33.1938 | 32.1739 | 19.4782 | 22.4868 | | pytorch_stargan | 16 | 24.278 | 22.4725 | 26.0908 | nan | 19.1053 | 19.9676 | | mobilenet_v3_large | 32 | 30.8 | 28.3156 | 30.4181 | 39.7834 | 16.1322 | 23.6649 | | resnext50_32x4d | 8 | 26.0743 | 23.6736 | 22.5481 | 34.5121 | 13.455 | 22.6415 | | densenet121 | 4 | 63.8961 | 68.8702 | 29.1283 | 89.1745 | 13.0957 | 52.6962 | | LearningToPaint | 96 | 15.5637 | 14.8618 | 18.091 | 16.8047 | 12.6387 | 13.014 | | alexnet | 128 | 12.4359 | 12.4742 | 15.4705 | 12.3719 | 10.6756 | 10.7055 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.4072 | 16.3769 | 13.5106 | 19.5647 | 10.1181 | 11.4268 | | timm_vision_transformer | 8 | 25.4919 | 25.1922 | 15.7115 | 40.9432 | 9.4827 | 17.2476 | | nvidia_deeprecommender | 256 | 8.545 | 8.866 | 14.5926 | 8.7552 | 9.4323 | 8.8532 | | tts_angular | 64 | 9.4545 | 9.7608 | 9.4724 | 9.6833 | 9.2459 | 9.2051 | | squeezenet1_1 | 32 | 12.623 | 12.5854 | 12.0302 | 15.4541 | 7.354 | 10.1081 | | resnet18 | 16 | 11.9586 | 10.6928 | 10.2823 | 13.7694 | 6.6046 | 9.8244 | | pytorch_struct | 200 | 3.7944 | 4.9137 | 4.2829 | 5.4007 | 2.0888 | 3.3463 | | dcgan | 32 | 2.6383 | 2.5518 | 2.111 | 3.2908 | 1.3637 | 2.5311 | | drq | 1 | 2.9566 | 3.5099 | 1.774 | 4.9976 | 1.2045 | 2.9056 | | soft_actor_critic | 256 | 1.0287 | 1.2655 | 0.995 | 1.4801 | 0.7229 | 1.1186 | | lennard_jones | 1000 | 1.1268 | 1.2583 | 1.0717 | 1.8608 | 0.645 | 1.1747 | | tacotron2 | 64 | 2816.2096 | 3047.301 | nan | 3714.693 | nan | 2989.3581 | | dlrm | 2048 | 471.5789 | nan | nan | nan | nan | 478.5168 | | hf_GPT2_large | 4 | 240.8753 | 245.7337 | nan | nan | nan | 164.0934 | | hf_T5 | 8 | 183.4206 | 191.9917 | nan | 157.171 | nan | 116.9599 | | functorch_dp_cifar10 | 64 | 11.2139 | 11.3676 | 5.329 | nan | nan | nan | | hf_BigBird | 2 | 200.5969 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9997 | 0.9317 | 0.0 | 0.8833 | 1.7985 | 1.8125 | | GPT2ForSequenceClassification | 4 | 1.0003 | 0.9781 | 0.0 | 0.6959 | 1.7724 | 1.7583 | | XLNetLMHeadModel | 8 | 0.9987 | 0.9679 | 0.0 | 0.0 | 1.7075 | 1.7124 | | MT5ForConditionalGeneration | 16 | 1.0223 | 0.9284 | 0.9634 | 1.0402 | 1.5577 | 1.5249 | | GoogleFnet | 16 | 0.9996 | 0.9988 | 0.0 | 1.5412 | 1.4413 | 1.5593 | | DistillGPT2 | 16 | 1.0001 | 0.9529 | 0.0 | 0.9205 | 1.4387 | 1.4857 | | ElectraForQuestionAnswering | 64 | 0.9999 | 0.9854 | 0.0 | 1.1924 | 1.4278 | 1.4084 | | T5ForConditionalGeneration | 4 | 1.0019 | 0.9389 | 0.7272 | 1.0972 | 1.4218 | 1.4107 | | T5Small | 4 | 1.0016 | 0.9317 | 0.7213 | 1.101 | 1.4194 | 1.4103 | | ElectraForCausalLM | 32 | 1.0004 | 0.9315 | 0.0 | 1.0167 | 1.4136 | 1.4492 | | MobileBertForMaskedLM | 64 | 1.0221 | 0.934 | 0.8243 | 0.0 | 1.3759 | 1.2117 | | LayoutLMForSequenceClassification | 16 | 0.9998 | 0.9799 | 0.7351 | 1.1174 | 1.3025 | 1.2903 | | BertForQuestionAnswering | 16 | 1.0002 | 0.9889 | 0.7327 | 1.1154 | 1.2855 | 1.267 | | RobertaForQuestionAnswering | 16 | 1.0004 | 0.9881 | 0.7339 | 1.1145 | 1.2773 | 1.2745 | | RobertaForCausalLM | 16 | 1.0002 | 0.9712 | 0.0 | 1.0558 | 1.2653 | 1.2728 | | AlbertForQuestionAnswering | 4 | 1.0 | 1.0014 | 0.0 | 1.2316 | 1.2558 | 1.2547 | | AlbertForMaskedLM | 4 | 1.0003 | 1.0001 | 0.0 | 1.2269 | 1.2517 | 1.2481 | | MegatronBertForQuestionAnswering | 8 | 1.0001 | 0.992 | 0.0 | 1.0948 | 1.215 | 1.1993 | | MegatronBertForCausalLM | 4 | 1.0002 | 0.9839 | 0.7253 | 1.0623 | 1.2126 | 1.1975 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9702 | 0.0 | 1.0622 | 1.1935 | 1.1941 | | BertForMaskedLM | 16 | 1.0 | 0.9701 | 0.0 | 1.0616 | 1.1731 | 1.176 | | YituTechConvBert | 16 | 1.0 | 0.968 | 0.0 | 1.0019 | 1.1722 | 1.172 | | CamemBert | 16 | 1.0002 | 0.9694 | 0.0 | 1.0604 | 1.1694 | 1.1749 | | PLBartForConditionalGeneration | 4 | 1.0 | 0.9572 | 0.0 | 0.9629 | 1.1633 | 1.1637 | | DistilBertForQuestionAnswering | 256 | 0.9999 | 0.9995 | 0.0 | 0.7884 | 1.1575 | 1.1538 | | Reformer | 16 | 0.9996 | 0.9998 | 0.9777 | 0.9763 | 1.1372 | 1.151 | | XGLMForCausalLM | 8 | 1.0103 | 0.8782 | 0.7392 | 0.3116 | 1.1345 | 1.153 | | PLBartForCausalLM | 8 | 1.0004 | 0.9516 | 0.0 | 0.962 | 1.1306 | 1.183 | | MobileBertForQuestionAnswering | 128 | 1.0229 | 0.9485 | 0.0 | 0.0 | 1.1281 | 1.1002 | | MBartForConditionalGeneration | 2 | 1.0019 | 0.9875 | 0.0 | 1.02 | 1.0965 | 1.0871 | | BartForConditionalGeneration | 2 | 1.0006 | 0.988 | 0.0 | 0.4503 | 1.0924 | 1.0857 | | MBartForCausalLM | 4 | 1.0008 | 0.9662 | 0.7491 | 0.9991 | 1.0846 | 1.094 | | BartForCausalLM | 4 | 1.0009 | 0.96 | 0.7555 | 0.9995 | 1.0823 | 1.0924 | | M2M100ForConditionalGeneration | 16 | 1.0741 | 0.9564 | 0.0 | 1.006 | 1.0769 | 1.045 | | DebertaForMaskedLM | 4 | 0.895 | 0.8036 | 0.7583 | 0.6255 | 1.0636 | 1.0374 | | DebertaForQuestionAnswering | 8 | 0.9956 | 0.9903 | 0.6835 | 0.8614 | 1.0493 | 1.2177 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0008 | 0.9399 | 0.0 | 0.9496 | 1.0351 | 1.0417 | | PegasusForConditionalGeneration | 32 | 0.9991 | 0.9805 | 0.0 | 0.9804 | 1.0158 | 1.0111 | | DistilBertForMaskedLM | 128 | 0.9996 | 0.953 | 0.0 | 0.8025 | 1.0106 | 1.0302 | | DebertaV2ForMaskedLM | 1 | 0.8786 | 0.7409 | 0.0 | 0.0 | 0.9984 | 0.8567 | | Speech2Text2ForCausalLM | 256 | 0.9986 | 0.9246 | 0.6532 | 0.9435 | 0.9884 | 1.0247 | | PegasusForCausalLM | 32 | 0.9991 | 0.9543 | 0.7333 | 0.9522 | 0.9716 | 0.9823 | | BlenderbotSmallForCausalLM | 64 | 1.0009 | 0.9106 | 0.6831 | 0.9191 | 0.9576 | 0.988 | | DebertaV2ForQuestionAnswering | 2 | 0.8978 | 0.8249 | 0.0 | 0.6189 | 0.9236 | 0.9097 | | TrOCRForCausalLM | 32 | 1.0006 | 0.9567 | 0.0 | 0.9669 | 0.0 | 1.027 | | BlenderbotForCausalLM | 4 | 1.0033 | 0.9856 | 0.0 | 0.9498 | 0.0 | 1.0114 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | DebertaV2ForMaskedLM | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | BlenderbotForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaV2ForQuestionAnswering | 2 | 7.5875 | 15.1818 | nan | 159.8115 | 161.7475 | 46.9759 | | DebertaV2ForMaskedLM | 1 | 7.3779 | 15.467 | nan | nan | 161.278 | 47.8922 | | XLNetLMHeadModel | 8 | 4.792 | 16.6364 | nan | nan | 126.2002 | 121.323 | | DebertaForQuestionAnswering | 8 | 4.6019 | 9.789 | 33.2378 | 71.9476 | 96.2649 | 36.0389 | | DebertaForMaskedLM | 4 | 4.7554 | 9.867 | 33.2643 | 81.3146 | 94.9684 | 34.1765 | | XGLMForCausalLM | 8 | 2.5621 | 10.0077 | 21.769 | 212.7425 | 71.4681 | 66.6312 | | MobileBertForQuestionAnswering | 128 | 8.6125 | 23.3842 | nan | nan | 53.4883 | 50.91 | | M2M100ForConditionalGeneration | 16 | 3.0992 | 12.7595 | nan | 252.1296 | 51.9282 | 51.6767 | | MobileBertForMaskedLM | 64 | 8.3984 | 23.1411 | 40.9968 | nan | 51.3339 | 50.4505 | | MT5ForConditionalGeneration | 16 | 3.743 | 10.8929 | 17.8416 | 136.0929 | 49.4498 | 49.312 | | BartForConditionalGeneration | 2 | 3.2136 | 12.4001 | nan | 283.732 | 43.2163 | 41.333 | | PegasusForConditionalGeneration | 32 | 2.9611 | 11.9179 | nan | 299.2691 | 42.4621 | 38.4451 | | MBartForConditionalGeneration | 2 | 3.3456 | 12.4559 | nan | 311.9514 | 41.4191 | 41.1938 | | YituTechConvBert | 16 | 2.3094 | 8.06 | nan | 126.1692 | 38.9222 | 36.7613 | | MegatronBertForCausalLM | 4 | 3.3777 | 10.5918 | 17.605 | 205.9626 | 32.6124 | 31.3276 | | MegatronBertForQuestionAnswering | 8 | 3.3184 | 10.2781 | nan | 199.7033 | 32.603 | 31.2717 | | BlenderbotSmallForConditionalGeneration | 64 | 2.0116 | 8.1719 | nan | 166.3589 | 29.1847 | 28.5676 | | T5ForConditionalGeneration | 4 | 2.4261 | 7.5034 | 12.1175 | 79.2008 | 29.034 | 29.0646 | | T5Small | 4 | 2.5333 | 7.5486 | 11.6567 | 79.4171 | 28.8451 | 29.0486 | | LayoutLMForSequenceClassification | 16 | 1.9197 | 5.799 | 8.9599 | 86.3869 | 27.5204 | 27.0742 | | GoogleFnet | 16 | 0.944 | 2.9279 | nan | 46.1857 | 26.5315 | 18.5367 | | PLBartForConditionalGeneration | 4 | 1.6488 | 6.6455 | nan | 119.4927 | 25.5918 | 24.3619 | | ElectraForCausalLM | 32 | 1.5712 | 5.3405 | nan | 90.5695 | 25.5713 | 23.2969 | | PegasusForCausalLM | 32 | 1.2313 | 4.7909 | 7.8042 | 88.0873 | 21.691 | 19.543 | | LayoutLMForMaskedLM | 16 | 1.9308 | 5.756 | nan | 90.3817 | 21.2734 | 19.6701 | | MBartForCausalLM | 4 | 1.2729 | 5.0293 | 7.4579 | 90.4868 | 20.4445 | 20.2741 | | BertForMaskedLM | 16 | 1.5999 | 5.2602 | nan | 89.7577 | 19.9824 | 18.9974 | | ElectraForQuestionAnswering | 64 | 1.5658 | 5.2857 | nan | 83.1705 | 19.5348 | 18.7904 | | BertForQuestionAnswering | 16 | 1.5968 | 5.2111 | 8.021 | 84.3944 | 19.4431 | 18.5533 | | BartForCausalLM | 4 | 1.2528 | 4.7612 | 7.5136 | 82.3714 | 19.3188 | 18.6558 | | CamemBert | 16 | 1.5512 | 5.3394 | nan | 87.3223 | 19.0945 | 18.1597 | | RobertaForCausalLM | 16 | 1.5228 | 5.2952 | nan | 90.9357 | 18.9408 | 18.521 | | RobertaForQuestionAnswering | 16 | 1.642 | 5.3445 | 8.4209 | 84.4476 | 18.0881 | 17.5392 | | OPTForCausalLM | 2 | 1.323 | 4.9547 | nan | 77.0091 | 16.6272 | 16.123 | | GPT2ForSequenceClassification | 4 | 1.5582 | 5.3058 | nan | 65.6459 | 16.4689 | 15.5738 | | Reformer | 16 | 1.4846 | 2.7037 | 5.2397 | 15.9862 | 16.1885 | 13.525 | | AlbertForMaskedLM | 4 | 1.3646 | 4.591 | nan | 114.6886 | 15.065 | 15.0064 | | AlbertForQuestionAnswering | 4 | 1.3686 | 4.6903 | nan | 110.8611 | 14.8529 | 14.5696 | | BlenderbotSmallForCausalLM | 64 | 0.7931 | 3.2122 | 5.0298 | 55.1768 | 13.8587 | 13.3286 | | DistillGPT2 | 16 | 0.8207 | 2.6474 | nan | 33.5945 | 13.1865 | 12.5447 | | Speech2Text2ForCausalLM | 256 | 0.7903 | 2.7001 | 4.5838 | 37.0374 | 13.1156 | 12.0637 | | PLBartForCausalLM | 8 | 0.6916 | 2.6687 | nan | 44.6682 | 12.8151 | 11.9711 | | DistilBertForMaskedLM | 128 | 0.6653 | 2.6879 | nan | 41.2882 | 11.0028 | 10.4836 | | DistilBertForQuestionAnswering | 256 | 0.7286 | 2.7646 | nan | 41.2394 | 10.5284 | 9.7104 | | BlenderbotForCausalLM | 4 | 2.3307 | 9.2431 | nan | 204.3171 | nan | 36.7033 | | TrOCRForCausalLM | 32 | 1.2496 | 4.7899 | nan | 82.9951 | nan | 18.3009 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | XLNetLMHeadModel | 8 | 1.0 | 0.9323 | nan | nan | 0.9946 | 0.9946 | | GoogleFnet | 16 | 0.9224 | 0.9224 | nan | 1.4614 | 0.9608 | 1.2768 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9344 | nan | 1.274 | 0.9316 | 1.2234 | | OPTForCausalLM | 2 | 1.0001 | 0.9258 | nan | 1.0746 | 0.9068 | 1.1143 | | YituTechConvBert | 16 | 0.9966 | 0.9341 | nan | 0.9891 | 0.894 | 0.9822 | | DistillGPT2 | 16 | 1.0 | 0.8855 | nan | 1.055 | 0.8939 | 1.0108 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | PegasusForConditionalGeneration | 32 | 0.9981 | 0.9529 | nan | 1.1152 | 0.8637 | 1.0262 | | M2M100ForConditionalGeneration | 16 | 0.9901 | 0.9084 | nan | 1.0107 | 0.8492 | 1.0061 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | PLBartForCausalLM | 8 | 1.0 | 0.8896 | nan | 1.0988 | 0.8367 | 1.0581 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | T5Small | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8215 | 1.1049 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.8429 | 0.7929 | 0.9036 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8931 | nan | 0.9681 | 0.7896 | 0.9837 | | MT5ForConditionalGeneration | 16 | 1.0014 | 0.8793 | 0.4388 | 0.9365 | 0.7785 | 0.9242 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 1.0402 | 0.7774 | 0.9692 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.958 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9223 | nan | 1.0616 | 0.7709 | 1.0379 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9018 | 0.3475 | 0.9999 | 0.7673 | 1.0153 | | MBartForCausalLM | 4 | 1.0 | 0.9122 | 0.3642 | 1.0011 | 0.7326 | 0.9478 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | BertForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | BertForMaskedLM | 16 | 1.0 | 0.9408 | nan | 0.9928 | 0.6945 | 0.9772 | | RobertaForCausalLM | 16 | 1.0 | 0.9405 | nan | 0.9926 | 0.6942 | 0.9771 | | CamemBert | 16 | 1.0 | 0.9388 | nan | 0.987 | 0.6942 | 0.9746 | | Speech2Text2ForCausalLM | 256 | 0.9545 | 0.8748 | 0.3515 | 0.8692 | 0.675 | 0.9168 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9602 | nan | 1.1897 | 0.6589 | 0.9118 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8847 | nan | 0.8827 | 0.6509 | 0.9194 | | Reformer | 16 | 0.9773 | 0.9773 | 0.5544 | 0.9998 | 0.573 | 1.0028 | | DebertaV2ForMaskedLM | 1 | 1.0 | 0.9651 | nan | nan | 0.5682 | 0.9491 | | MobileBertForMaskedLM | 64 | 1.0 | 0.906 | 0.3175 | nan | 0.4951 | 0.6649 | | DebertaV2ForQuestionAnswering | 2 | 0.9842 | 0.9842 | nan | 0.9842 | 0.4735 | 0.984 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9909 | nan | nan | 0.4145 | 0.535 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | TrOCRForCausalLM | 32 | 1.0 | 0.8787 | nan | 0.9998 | nan | 0.9239 | | BlenderbotForCausalLM | 4 | 1.0001 | 0.8057 | nan | 0.8218 | nan | 0.8509 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 387.7248 | 387.2579 | nan | 314.4472 | 309.567 | 310.4201 | | AlbertForQuestionAnswering | 4 | 384.543 | 383.971 | nan | 311.2643 | 306.6775 | 306.1384 | | Reformer | 16 | 305.9596 | 305.7486 | 313.4608 | 313.1795 | 269.4018 | 265.5577 | | XLNetLMHeadModel | 8 | 376.7996 | 389.1473 | nan | nan | 221.5258 | 220.3034 | | PegasusForConditionalGeneration | 32 | 176.1373 | 179.3812 | nan | 179.9476 | 173.822 | 173.9286 | | MegatronBertForQuestionAnswering | 8 | 172.3053 | 173.8933 | nan | 157.3752 | 142.0267 | 143.4791 | | BartForConditionalGeneration | 2 | 149.9114 | 151.8602 | nan | 333.1933 | 137.2785 | 138.0897 | | MBartForConditionalGeneration | 2 | 150.0928 | 152.1319 | nan | 146.732 | 136.781 | 137.9904 | | YituTechConvBert | 16 | 155.4528 | 160.4522 | nan | 154.9035 | 132.586 | 132.5151 | | DistilBertForQuestionAnswering | 256 | 144.7182 | 144.6839 | nan | 183.3724 | 125.415 | 125.6805 | | MobileBertForQuestionAnswering | 128 | 152.7414 | 149.5507 | nan | nan | 124.7872 | 129.0241 | | DistilBertForMaskedLM | 128 | 122.128 | 128.1322 | nan | 152.2538 | 121.0061 | 118.5078 | | MobileBertForMaskedLM | 64 | 145.7956 | 160.0503 | 196.1489 | nan | 120.566 | 129.0674 | | CamemBert | 16 | 135.613 | 139.8573 | nan | 127.9773 | 116.2498 | 115.4817 | | BlenderbotSmallForConditionalGeneration | 64 | 119.0104 | 126.6577 | nan | 125.2398 | 115.1871 | 114.2546 | | LayoutLMForMaskedLM | 16 | 137.0088 | 141.0325 | nan | 128.8164 | 114.8154 | 114.6003 | | BertForMaskedLM | 16 | 134.3134 | 138.4206 | nan | 126.5905 | 114.629 | 114.1442 | | DebertaV2ForQuestionAnswering | 2 | 117.7623 | 128.2774 | nan | 170.5264 | 114.3336 | 116.1827 | | BartForCausalLM | 4 | 123.4201 | 126.6475 | 163.6442 | 123.3318 | 114.0589 | 112.8794 | | MBartForCausalLM | 4 | 123.2186 | 127.3062 | 162.6153 | 123.4965 | 113.7313 | 112.5552 | | RobertaForCausalLM | 16 | 142.5583 | 146.7229 | nan | 135.0643 | 112.915 | 112.0249 | | M2M100ForConditionalGeneration | 16 | 108.5421 | 122.1431 | nan | 114.9843 | 108.3458 | 111.9751 | | PLBartForConditionalGeneration | 4 | 121.482 | 128.0077 | nan | 126.6553 | 104.8612 | 104.3759 | | PLBartForCausalLM | 8 | 114.8833 | 123.798 | nan | 121.1135 | 102.2554 | 99.6237 | | OPTForCausalLM | 2 | 169.0357 | 181.2948 | nan | 189.5036 | 93.8385 | 93.0748 | | PegasusForCausalLM | 32 | 85.6122 | 89.5336 | 116.7848 | 89.9086 | 88.3129 | 87.0925 | | DebertaV2ForMaskedLM | 1 | 106.9448 | 119.6478 | nan | nan | 87.6216 | 101.8518 | | ElectraForQuestionAnswering | 64 | 124.9066 | 126.8964 | nan | 104.8065 | 87.4892 | 88.6199 | | RobertaForQuestionAnswering | 16 | 111.1348 | 112.3076 | 151.2087 | 99.4998 | 87.1722 | 87.1613 | | LayoutLMForSequenceClassification | 16 | 113.1364 | 115.7178 | 155.144 | 101.2791 | 86.9892 | 87.7157 | | BertForQuestionAnswering | 16 | 110.5672 | 111.7834 | 151.0446 | 99.2096 | 86.2522 | 87.3818 | | MegatronBertForCausalLM | 4 | 101.8232 | 103.8178 | 141.4491 | 95.8941 | 84.3843 | 85.0355 | | DistillGPT2 | 16 | 120.7001 | 126.7558 | nan | 131.1247 | 83.8503 | 81.1932 | | DebertaForQuestionAnswering | 8 | 82.2126 | 82.6319 | 119.8082 | 95.0766 | 78.1838 | 67.1539 | | ElectraForCausalLM | 32 | 105.8005 | 113.5525 | nan | 104.1467 | 75.084 | 72.9629 | | T5Small | 4 | 104.0766 | 111.4264 | 144.6069 | 94.6693 | 73.3347 | 73.7778 | | T5ForConditionalGeneration | 4 | 104.0798 | 111.4372 | 143.3223 | 94.9882 | 73.2494 | 73.8295 | | GoogleFnet | 16 | 101.7689 | 101.8783 | nan | 66.1608 | 70.6647 | 65.2631 | | XGLMForCausalLM | 8 | 79.9659 | 93.182 | 108.5829 | 253.6017 | 70.3461 | 69.8984 | | BlenderbotSmallForCausalLM | 64 | 64.6532 | 71.1341 | 94.7125 | 70.4694 | 67.6767 | 65.4294 | | Speech2Text2ForCausalLM | 256 | 63.9218 | 69.4133 | 97.9279 | 67.9627 | 64.8148 | 62.3735 | | GPT2ForSequenceClassification | 4 | 102.4815 | 104.5575 | nan | 148.6793 | 57.7143 | 58.0926 | | MT5ForConditionalGeneration | 16 | 92.0599 | 95.3754 | 97.0767 | 85.8018 | 57.4945 | 59.5948 | | DebertaForMaskedLM | 4 | 72.2667 | 74.1809 | 82.9622 | 97.7468 | 56.0433 | 57.9118 | | TrOCRForCausalLM | 32 | 167.9484 | 174.6552 | nan | 174.4972 | nan | 163.3366 | | BlenderbotForCausalLM | 4 | 92.6291 | 94.6425 | nan | 98.265 | nan | 92.5935 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.999 | 0.9705 | 0.8272 | 1.2966 | 1.8699 | 1.8307 | | lcnet_050 | 128 | 0.9547 | 0.9495 | 0.7529 | 1.2879 | 1.6578 | 1.5009 | | regnety_002 | 128 | 0.9766 | 1.002 | 0.8451 | 0.9692 | 1.479 | 1.3279 | | hrnet_w18 | 128 | 1.0 | 0.9986 | 0.0 | 1.278 | 1.4152 | 1.3769 | | dla102 | 128 | 0.9999 | 1.0009 | 0.0 | 1.2835 | 1.3834 | 1.3682 | | volo_d1_224 | 64 | 0.9999 | 0.9956 | 0.8025 | 0.0 | 1.3778 | 1.3595 | | res2net50_14w_8s | 128 | 0.9999 | 0.9993 | 0.0 | 1.2507 | 1.3552 | 1.3243 | | coat_lite_mini | 128 | 1.0 | 0.9997 | 0.845 | 1.0937 | 1.3512 | 1.3321 | | xcit_large_24_p8_224 | 5 | 1.0014 | 0.9931 | 0.7789 | 0.0 | 1.3361 | 1.2968 | | mobilenetv2_100 | 128 | 0.9662 | 0.963 | 0.7063 | 1.2793 | 1.3342 | 1.355 | | mobilenetv3_large_100 | 128 | 0.9655 | 0.9604 | 0.7647 | 1.2855 | 1.3309 | 1.347 | | gluon_inception_v3 | 128 | 0.9999 | 0.9988 | 0.0 | 1.1291 | 1.3268 | 1.3068 | | inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 1.1291 | 1.326 | 1.3073 | | adv_inception_v3 | 128 | 1.0 | 0.9986 | 0.0 | 1.1292 | 1.3238 | 1.3068 | | crossvit_9_240 | 128 | 0.9996 | 0.9985 | 0.7605 | 1.0345 | 1.3207 | 1.3011 | | resnest101e | 64 | 0.9998 | 1.0034 | 0.0 | 1.1696 | 1.3164 | 1.2703 | | res2next50 | 128 | 0.9999 | 1.0008 | 0.0 | 1.1804 | 1.3103 | 1.274 | | fbnetv3_b | 128 | 0.9649 | 0.9589 | 0.7618 | 1.235 | 1.2844 | 1.2969 | | gmixer_24_224 | 128 | 0.9998 | 0.8347 | 0.0 | 1.0829 | 1.2697 | 1.2599 | | botnet26t_256 | 128 | 0.9857 | 0.9843 | 0.7897 | 0.0 | 1.269 | 1.2804 | | sebotnet33ts_256 | 64 | 0.9759 | 0.8072 | 0.0 | 0.0 | 1.2661 | 1.2769 | | selecsls42b | 128 | 0.9999 | 0.9983 | 0.8146 | 1.2158 | 1.2651 | 1.2531 | | mnasnet_100 | 128 | 0.9662 | 0.9619 | 0.7862 | 1.2537 | 1.2644 | 1.2815 | | eca_botnext26ts_256 | 128 | 0.9861 | 0.7728 | 0.0 | 0.0 | 1.2639 | 1.2561 | | jx_nest_base | 32 | 0.9997 | 0.9955 | 0.7365 | 0.0 | 1.2594 | 1.2305 | | tf_efficientnet_b0 | 128 | 0.9772 | 0.7837 | 0.0 | 1.164 | 1.2582 | 1.2662 | | eca_halonext26ts | 128 | 0.9873 | 0.7791 | 0.0 | 0.0 | 1.2527 | 1.2399 | | fbnetc_100 | 128 | 0.966 | 0.963 | 0.79 | 1.2456 | 1.2499 | 1.2661 | | ese_vovnet19b_dw | 128 | 0.9794 | 0.9785 | 0.7441 | 1.1492 | 1.2397 | 1.2417 | | convit_base | 64 | 0.9996 | 0.999 | 0.0 | 0.0 | 1.2361 | 1.2118 | | spnasnet_100 | 128 | 0.9619 | 0.9585 | 0.7741 | 1.2232 | 1.2359 | 1.254 | | cspdarknet53 | 64 | 0.958 | 0.954 | 0.7358 | 1.1746 | 1.2319 | 1.2404 | | res2net101_26w_4s | 64 | 0.9998 | 0.9961 | 0.7696 | 1.1149 | 1.2297 | 1.1874 | | cait_m36_384 | 4 | 0.9996 | 0.9994 | 0.0 | 0.0 | 1.2166 | 1.1914 | | rexnet_100 | 128 | 0.973 | 0.8147 | 0.0 | 1.1561 | 1.2111 | 1.2202 | | gmlp_s16_224 | 128 | 0.9997 | 0.9989 | 0.0 | 1.0848 | 1.2087 | 1.2027 | | pnasnet5large | 16 | 1.0001 | 0.9952 | 0.0 | 1.0897 | 1.2081 | 1.1919 | | tinynet_a | 128 | 0.9665 | 0.7736 | 0.6211 | 1.1469 | 1.1894 | 1.1994 | | dpn107 | 32 | 0.9591 | 0.95 | 0.7806 | 1.0278 | 1.1892 | 1.2016 | | dm_nfnet_f0 | 128 | 0.9992 | 1.0 | 0.0 | 1.1381 | 1.189 | 1.1592 | | tf_mixnet_l | 128 | 0.9855 | 0.8898 | 0.0 | 1.0938 | 1.1864 | 1.1863 | | pit_b_224 | 64 | 1.0001 | 0.9995 | 0.0 | 1.0325 | 1.1858 | 1.1743 | | twins_pcpvt_base | 64 | 1.0 | 0.9987 | 0.7397 | 0.0 | 1.1785 | 1.15 | | mixnet_l | 128 | 0.9851 | 0.8859 | 0.0 | 1.0982 | 1.1744 | 1.1711 | | repvgg_a2 | 128 | 0.9641 | 0.9636 | 0.8275 | 1.1368 | 1.1718 | 1.1687 | | mobilevit_s | 64 | 0.9795 | 0.7625 | 0.0 | 0.0 | 1.168 | 1.1701 | | poolformer_m36 | 64 | 0.9998 | 0.9999 | 0.0 | 0.0 | 1.1667 | 1.1486 | | nfnet_l0 | 128 | 0.9996 | 0.7888 | 0.0 | 1.1051 | 1.1467 | 1.1133 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9795 | 0.0 | 0.0 | 1.1212 | 1.1177 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9816 | 0.0 | 0.0 | 1.1126 | 1.1 | | swsl_resnext101_32x16d | 32 | 1.0 | 1.0001 | 0.0 | 1.1082 | 1.1097 | 1.0713 | | deit_base_distilled_patch16_224 | 64 | 1.0011 | 0.9991 | 0.7702 | 0.9816 | 1.0963 | 1.0821 | | gluon_xception65 | 32 | 1.0 | 0.9977 | 0.0 | 1.08 | 1.086 | 1.0748 | | vit_base_patch16_224 | 64 | 0.9994 | 0.9991 | 0.7689 | 0.951 | 1.0852 | 1.0735 | | convmixer_768_32 | 32 | 0.9999 | 1.0 | 0.0 | 0.0 | 1.0776 | 1.0738 | | mixer_b16_224 | 128 | 1.0003 | 1.0003 | 0.0 | 0.8971 | 1.0744 | 1.0669 | | gernet_l | 128 | 0.9745 | 0.9733 | 0.8229 | 1.0994 | 1.0738 | 1.0714 | | visformer_small | 128 | 0.9998 | 1.0028 | 0.7972 | 0.0 | 1.0436 | 1.0092 | | convnext_base | 64 | 1.0 | 0.9986 | 0.0 | 0.0 | 1.0346 | 1.0371 | | resmlp_12_224 | 128 | 0.9998 | 1.0011 | 0.6954 | 1.2078 | 0.9596 | 0.9761 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9995 | 0.0 | 0.0 | 0.0 | 1.5083 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.2929 | 24.9713 | nan | 682.4262 | 105.0616 | 99.9065 | | twins_pcpvt_base | 64 | 2.2306 | 10.2708 | 19.5652 | nan | 81.1809 | 79.1741 | | swin_base_patch4_window7_224 | 64 | 2.7046 | 10.0411 | nan | nan | 76.7804 | 74.366 | | mobilevit_s | 64 | 1.795 | 6.1616 | nan | nan | 75.4725 | 74.5606 | | xcit_large_24_p8_224 | 5 | 2.939 | 13.7481 | 25.7207 | nan | 74.9281 | 72.5685 | | pnasnet5large | 16 | 4.5441 | 18.5032 | nan | 359.8606 | 70.1303 | 65.8596 | | cait_m36_384 | 4 | 3.0749 | 14.6281 | nan | nan | 60.748 | 57.4591 | | coat_lite_mini | 128 | 1.0024 | 4.3658 | 6.4908 | 96.2783 | 59.0024 | 58.7995 | | dm_nfnet_f0 | 128 | 2.0626 | 6.0381 | nan | 153.9005 | 58.9538 | 57.1287 | | resnest101e | 64 | 3.3274 | 12.8637 | nan | 283.0956 | 55.9483 | 53.3725 | | jx_nest_base | 32 | 1.7569 | 7.142 | 12.1267 | nan | 52.5545 | 51.9513 | | res2net101_26w_4s | 64 | 3.1557 | 13.2568 | 23.4991 | 246.8888 | 50.9245 | 48.3492 | | eca_halonext26ts | 128 | 1.3954 | 4.4165 | nan | nan | 49.5536 | 48.7609 | | res2net50_14w_8s | 128 | 2.7636 | 12.3806 | nan | 260.2003 | 47.8915 | 44.9596 | | poolformer_m36 | 64 | 1.6989 | 6.2226 | nan | nan | 45.3826 | 43.747 | | nfnet_l0 | 128 | 1.7925 | 6.2741 | nan | 127.404 | 45.2826 | 44.1609 | | convnext_base | 64 | 1.3446 | 5.3953 | nan | nan | 43.8093 | 43.1254 | | gmlp_s16_224 | 128 | 1.0506 | 5.2818 | nan | 157.9222 | 40.2729 | 38.3562 | | sebotnet33ts_256 | 64 | 1.6346 | 5.3073 | nan | nan | 40.0999 | 38.7716 | | volo_d1_224 | 64 | 1.3375 | 5.9329 | 9.8781 | nan | 37.3891 | 34.694 | | dpn107 | 32 | 3.9109 | 11.4184 | 35.1804 | 172.9252 | 37.0718 | 35.0352 | | crossvit_9_240 | 128 | 1.4747 | 6.3781 | 10.2929 | 170.427 | 36.5509 | 34.3838 | | fbnetv3_b | 128 | 3.0896 | 9.6588 | 24.5036 | 241.5964 | 34.9682 | 32.2209 | | gluon_xception65 | 32 | 1.98 | 8.8028 | nan | 152.9882 | 34.4997 | 32.298 | | eca_botnext26ts_256 | 128 | 1.4021 | 4.2598 | nan | nan | 33.9288 | 32.7668 | | adv_inception_v3 | 128 | 1.6373 | 6.8264 | nan | 148.4554 | 32.9529 | 29.7838 | | tf_mixnet_l | 128 | 5.8673 | 11.296 | nan | 153.575 | 32.1705 | 31.0103 | | gmixer_24_224 | 128 | 1.1662 | 5.8336 | nan | 146.6337 | 31.7622 | 29.4381 | | gluon_inception_v3 | 128 | 1.6569 | 7.4532 | nan | 140.6821 | 31.4334 | 30.7537 | | inception_v3 | 128 | 1.6646 | 6.9587 | nan | 148.527 | 31.403 | 29.8779 | | mixnet_l | 128 | 5.3424 | 10.7621 | nan | 155.7384 | 31.2219 | 29.632 | | ghostnet_100 | 128 | 2.9285 | 8.2655 | 12.0286 | 159.4196 | 30.9692 | 30.0942 | | botnet26t_256 | 128 | 1.2795 | 3.7314 | 8.6044 | nan | 30.1698 | 29.1958 | | dla102 | 128 | 1.8243 | 7.6097 | nan | 179.4639 | 29.7086 | 27.7706 | | swsl_resnext101_32x16d | 32 | 1.776 | 7.4858 | nan | 127.895 | 28.7597 | 27.1666 | | res2next50 | 128 | 1.6017 | 6.6431 | nan | 153.0573 | 27.4802 | 25.418 | | convit_base | 64 | 1.0362 | 4.4656 | nan | nan | 27.3648 | 27.2904 | | rexnet_100 | 128 | 1.9735 | 6.4265 | nan | 147.1292 | 25.4388 | 23.9873 | | tinynet_a | 128 | 2.057 | 6.8489 | 17.4042 | 143.3406 | 25.2962 | 23.3033 | | cspdarknet53 | 64 | 2.3452 | 6.2183 | 16.855 | 118.8192 | 22.069 | 21.6262 | | resmlp_12_224 | 128 | 0.6018 | 2.3546 | 3.6816 | 32.9956 | 22.0502 | 20.0512 | | mixer_b16_224 | 128 | 0.5643 | 2.577 | nan | 67.6397 | 21.9931 | 21.4167 | | tf_efficientnet_b0 | 128 | 1.7859 | 6.122 | nan | 128.2621 | 21.6278 | 20.8771 | | convmixer_768_32 | 32 | 1.1919 | 4.9801 | nan | nan | 21.2736 | 19.8474 | | fbnetc_100 | 128 | 2.0446 | 5.5417 | 16.0445 | 111.0159 | 21.1569 | 19.691 | | visformer_small | 128 | 0.9168 | 3.2756 | 5.2158 | nan | 21.1231 | 20.7251 | | spnasnet_100 | 128 | 1.9818 | 5.7908 | 15.1474 | 108.049 | 20.4729 | 19.6643 | | pit_b_224 | 64 | 0.9582 | 3.7261 | nan | 99.2475 | 19.5901 | 19.2647 | | mobilenetv3_large_100 | 128 | 1.5478 | 4.8059 | 11.6987 | 120.332 | 19.4954 | 18.5854 | | beit_base_patch16_224 | 64 | 1.1204 | 4.5117 | nan | nan | 19.3842 | 18.561 | | deit_base_distilled_patch16_224 | 64 | 0.805 | 3.4474 | 5.461 | 75.3703 | 18.7271 | 18.1298 | | vit_base_patch16_224 | 64 | 0.8083 | 3.3512 | 5.3099 | 75.3835 | 18.5136 | 17.5917 | | mobilenetv2_100 | 128 | 1.6787 | 4.6357 | 11.8227 | 99.383 | 18.0051 | 17.0878 | | regnety_002 | 128 | 1.568 | 4.5601 | 11.4298 | 93.4265 | 17.6547 | 16.2406 | | repvgg_a2 | 128 | 1.9481 | 5.1264 | 14.0335 | 156.1892 | 17.5825 | 17.054 | | mnasnet_100 | 128 | 1.6016 | 4.6232 | 11.8916 | 89.751 | 17.4419 | 17.6849 | | gernet_l | 128 | 1.9583 | 5.0948 | 13.7453 | 86.4319 | 16.9644 | 16.2043 | | selecsls42b | 128 | 0.7419 | 3.0366 | 4.8283 | 74.3226 | 15.2609 | 14.5925 | | lcnet_050 | 128 | 0.9468 | 2.9714 | 6.7714 | 65.3216 | 12.8526 | 12.1026 | | ese_vovnet19b_dw | 128 | 0.9097 | 2.4545 | 5.9097 | 54.5864 | 12.4057 | 11.871 | | tnt_s_patch16_224 | 128 | 1.6533 | 8.4705 | nan | nan | nan | 32.4096 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | 0.9898 | 1.351 | 1.5843 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1877 | 1.3424 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1378 | 1.2737 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.2529 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.1264 | 1.3578 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1802 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.0689 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9924 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9883 | 1.0887 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9765 | 1.1445 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9658 | 1.0598 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9633 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9468 | 1.1098 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 0.9435 | 1.0967 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.988 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8646 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0516 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9065 | 1.0615 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9058 | 0.9905 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8822 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.8578 | 1.0369 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8526 | 1.0752 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8212 | 1.0246 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8189 | 0.9416 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7449 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6742 | 0.9001 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8633 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 365.1715 | 364.9802 | nan | nan | 339.1181 | 339.8642 | | hrnet_w18 | 128 | 416.9867 | 417.3067 | nan | 325.8694 | 293.7206 | 302.2691 | | convnext_base | 64 | 263.1941 | 263.431 | nan | nan | 254.1796 | 253.5463 | | pnasnet5large | 16 | 289.5518 | 290.3664 | nan | 265.1746 | 239.3766 | 242.6242 | | tf_mixnet_l | 128 | 256.6419 | 284.2774 | nan | 231.089 | 213.0748 | 213.6383 | | swin_base_patch4_window7_224 | 64 | 237.1134 | 241.8313 | nan | nan | 211.3143 | 211.9673 | | mixnet_l | 128 | 247.0435 | 275.1953 | nan | 222.0309 | 207.5335 | 207.8259 | | swsl_resnext101_32x16d | 32 | 219.2984 | 219.8247 | nan | 197.9398 | 198.0849 | 205.1263 | | dla102 | 128 | 269.5865 | 269.1213 | nan | 209.6743 | 194.9212 | 196.7203 | | cait_m36_384 | 4 | 217.0722 | 216.4743 | nan | nan | 178.0832 | 181.6369 | | resnest101e | 64 | 230.6676 | 229.8405 | nan | 197.2912 | 175.2213 | 181.323 | | dm_nfnet_f0 | 128 | 205.5241 | 205.7681 | nan | 181.1209 | 173.1752 | 177.9174 | | adv_inception_v3 | 128 | 226.2175 | 226.7987 | nan | 200.6797 | 171.2765 | 173.1554 | | gluon_inception_v3 | 128 | 226.6443 | 226.8118 | nan | 200.6427 | 170.6557 | 173.1369 | | inception_v3 | 128 | 226.368 | 226.5059 | nan | 200.9492 | 170.618 | 173.1438 | | res2net50_14w_8s | 128 | 229.6827 | 229.6853 | nan | 183.5885 | 169.175 | 172.9656 | | gluon_xception65 | 32 | 182.9635 | 182.8431 | nan | 168.8265 | 167.9453 | 169.5762 | | convit_base | 64 | 196.6202 | 196.6527 | nan | nan | 159.3757 | 162.4718 | | res2next50 | 128 | 206.8256 | 206.647 | nan | 175.534 | 157.7007 | 162.3691 | | dpn107 | 32 | 190.8675 | 192.4899 | 234.7342 | 178.2091 | 153.9699 | 152.2509 | | nfnet_l0 | 128 | 176.7768 | 223.1128 | nan | 159.0551 | 153.3579 | 158.1713 | | gernet_l | 128 | 165.0919 | 165.4252 | 193.6709 | 146.2249 | 149.7797 | 149.9718 | | poolformer_m36 | 64 | 174.8055 | 174.5228 | nan | nan | 149.6291 | 152.0586 | | mixer_b16_224 | 128 | 159.6004 | 159.159 | nan | 177.2017 | 148.6276 | 150.6306 | | coat_lite_mini | 128 | 191.5594 | 191.5675 | 226.6767 | 175.5827 | 141.9248 | 143.7604 | | pit_b_224 | 64 | 158.4785 | 158.5092 | nan | 153.3013 | 133.6831 | 134.9665 | | eca_halonext26ts | 128 | 169.2789 | 214.7912 | nan | nan | 133.4087 | 134.6564 | | eca_botnext26ts_256 | 128 | 163.5713 | 208.7791 | nan | nan | 127.6059 | 128.2994 | | gmlp_s16_224 | 128 | 152.101 | 152.04 | nan | 139.9808 | 125.6929 | 126.2023 | | res2net101_26w_4s | 64 | 152.0087 | 152.0614 | 197.4841 | 136.0594 | 123.5806 | 127.8966 | | visformer_small | 128 | 128.1 | 128.0013 | 160.8698 | nan | 123.0761 | 127.0384 | | fbnetv3_b | 128 | 162.5961 | 163.7521 | 206.1694 | 126.9552 | 122.0596 | 120.74 | | botnet26t_256 | 128 | 152.2556 | 152.4839 | 190.263 | nan | 118.2554 | 117.2602 | | twins_pcpvt_base | 64 | 137.0264 | 137.2377 | 185.0973 | nan | 116.8092 | 119.1883 | | beit_base_patch16_224 | 64 | 128.5479 | 131.0978 | nan | nan | 115.5966 | 116.8639 | | gmixer_24_224 | 128 | 146.3197 | 175.374 | nan | 135.2603 | 115.4243 | 116.1436 | | volo_d1_224 | 64 | 153.2517 | 154.0116 | 191.4399 | nan | 111.5953 | 112.6387 | | vit_base_patch16_224 | 64 | 120.2176 | 119.9657 | 156.3461 | 125.333 | 110.9227 | 111.787 | | deit_base_distilled_patch16_224 | 64 | 120.1365 | 120.741 | 157.1827 | 122.8021 | 110.0904 | 111.4881 | | repvgg_a2 | 128 | 127.3589 | 127.2248 | 146.4519 | 107.9294 | 104.7877 | 104.933 | | tf_efficientnet_b0 | 128 | 133.7842 | 167.0917 | nan | 112.2658 | 103.911 | 103.4093 | | xcit_large_24_p8_224 | 5 | 134.6216 | 136.7003 | 173.2239 | nan | 101.8037 | 104.3862 | | cspdarknet53 | 64 | 130.3965 | 130.7088 | 169.6577 | 106.4031 | 101.1782 | 100.5563 | | mobilevit_s | 64 | 117.3119 | 150.4998 | nan | nan | 98.239 | 97.899 | | jx_nest_base | 32 | 121.3549 | 121.6963 | 164.7993 | nan | 96.349 | 98.4107 | | rexnet_100 | 128 | 119.3545 | 142.504 | nan | 100.1945 | 95.8057 | 95.151 | | fbnetc_100 | 128 | 123.4822 | 123.9687 | 151.304 | 95.8472 | 95.3881 | 94.0096 | | tinynet_a | 128 | 109.8469 | 137.0396 | 171.3784 | 92.5564 | 89.3392 | 88.6875 | | sebotnet33ts_256 | 64 | 114.455 | 138.45 | nan | nan | 88.194 | 87.3988 | | spnasnet_100 | 128 | 105.9109 | 106.4053 | 131.8043 | 83.1489 | 82.4917 | 81.104 | | ese_vovnet19b_dw | 128 | 99.5839 | 99.7581 | 131.2122 | 84.797 | 78.5671 | 78.4204 | | mnasnet_100 | 128 | 98.5196 | 99.0892 | 121.3253 | 76.082 | 75.544 | 74.2888 | | crossvit_9_240 | 128 | 98.6088 | 98.4846 | 129.7678 | 95.1625 | 74.5992 | 75.5724 | | resmlp_12_224 | 128 | 71.1614 | 71.1433 | 102.3785 | 58.8905 | 74.3162 | 72.9419 | | selecsls42b | 128 | 89.5184 | 89.7606 | 109.8286 | 73.6446 | 70.8387 | 71.4246 | | mobilenetv2_100 | 128 | 97.7358 | 98.0826 | 133.7093 | 73.7471 | 70.734 | 69.5851 | | mobilenetv3_large_100 | 128 | 85.6432 | 86.0704 | 107.9556 | 64.1635 | 62.136 | 61.2752 | | ghostnet_100 | 128 | 114.7439 | 118.0288 | 138.8982 | 88.2232 | 61.4335 | 62.5754 | | regnety_002 | 128 | 53.2048 | 51.3036 | 60.6441 | 53.3676 | 35.1057 | 39.6063 | | lcnet_050 | 128 | 38.3827 | 38.6256 | 48.6943 | 28.5095 | 22.1033 | 24.4223 | | tnt_s_patch16_224 | 128 | 470.6009 | 470.5537 | nan | nan | nan | 312.2606 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/8Vcm4MI.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/JnjlwUV.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/W71GCMX.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 91%, 42/46  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 91%, 42/46  | 97%, 59/61  |
|     aot_cudagraphs     | 80%, 43/54 | 70%, 32/46  | 90%, 55/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  7%, 3/46   | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 83%, 38/46  | 87%, 53/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 85%, 39/46  | 89%, 54/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.23x    |    1.01x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.11x    |    1.09x    |
|        inductor        |   1.65x    |    1.64x    |    1.18x    |
| inductor_no_cudagraphs |   1.28x    |    1.57x    |    1.15x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.01    |    2.92     |    2.21     |
|       aot_eager        |    6.54    |    10.47    |    8.63     |
|     aot_cudagraphs     |    9.63    |    17.35    |    15.90    |
|    nvprims_nvfuser     |   63.02    |   113.69    |   147.66    |
|        inductor        |   72.95    |    36.82    |    76.16    |
| inductor_no_cudagraphs |   68.94    |    33.11    |    74.29    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.83x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.37x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.07x    |    0.86x    |
|        inductor        |   0.78x    |    0.92x    |    0.88x    |
| inductor_no_cudagraphs |   0.92x    |    1.07x    |    1.03x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Previous report name: /data/home/anijain/cluster/cron_logs/day_321_17_11_22_performance_amp_119 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 83%, 45/54 | 83%, 45/54 | | inductor | huggingface | 90%, 38/42 | 90%, 38/42 | | inductor | timm_models | 92%, 56/61 | 92%, 56/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 87%, 47/54 | | inductor_no_cudagraphs | huggingface | 90%, 38/42 | 90%, 38/42 | | inductor_no_cudagraphs | timm_models | 92%, 56/61 | 92%, 56/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.90x | 1.89x | | inductor | huggingface | 1.82x | 1.81x | | inductor | timm_models | 1.42x | 1.43x | | inductor_no_cudagraphs | torchbench | 1.38x | 1.38x | | inductor_no_cudagraphs | huggingface | 1.57x | 1.57x | | inductor_no_cudagraphs | timm_models | 1.37x | 1.37x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | pass | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | YituTechConvBert | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | huggingface | DebertaV2ForMaskedLM | fail_to_run | fail_to_run | | huggingface | BlenderbotForCausalLM | 0.0000 | 0.0000 | | timm_models | eca_halonext26ts | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | ese_vovnet19b_dw | fail_accuracy | pass | | timm_models | ghostnet_100 | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | | timm_models | hrnet_w18 | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | timm_vovnet | 1.0024 | 0.896 | | torchbench | timm_regnet | 0.9343 | 0.8512 | | torchbench | resnet50 | 0.9015 | 0.7594 | | torchbench | mobilenet_v2 | 0.8572 | 0.8439 | | torchbench | yolov3 | 0.8387 | 0.8093 | | torchbench | hf_GPT2_large | 0.0 | 1.855 | | torchbench | dlrm | 0.0 | 1.172 | | torchbench | tacotron2 | 0.0 | 0.8797 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.2334 | 0.9081 | | huggingface | DebertaV2ForQuestionAnswering | 1.0925 | 0.9445 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.1688 | | huggingface | YituTechConvBert | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | poolformer_m36 | 0.9595 | 0.9491 | | timm_models | fbnetv3_b | 0.9318 | 0.9635 | | timm_models | selecsls42b | 0.8724 | 0.8617 | | timm_models | dla102 | 0.8363 | 0.8176 | | timm_models | cspdarknet53 | 0.8101 | 0.8097 | | timm_models | gernet_l | 0.7873 | 0.7672 | | timm_models | tf_efficientnet_b0 | 0.787 | 0.8935 | | timm_models | res2net101_26w_4s | 0.784 | 0.7327 | | timm_models | tinynet_a | 0.78 | 0.8259 | | timm_models | mobilenetv2_100 | 0.7787 | 0.6304 | | timm_models | dpn107 | 0.7719 | 0.7467 | | timm_models | resnest101e | 0.7576 | 0.7694 | | timm_models | gluon_xception65 | 0.7503 | 0.6438 | | timm_models | repvgg_a2 | 0.7408 | 0.7729 | | timm_models | mobilevit_s | 0.7378 | 0.7497 | | timm_models | res2net50_14w_8s | 0.7271 | 0.6206 | | timm_models | convmixer_768_32 | 0.7237 | 0.7143 | | timm_models | visformer_small | 0.7106 | 0.698 | | timm_models | sebotnet33ts_256 | 0.6872 | 0.7284 | | timm_models | ese_vovnet19b_dw | 0.6602 | 0.6811 | | timm_models | swsl_resnext101_32x16d | 0.6538 | 0.6637 | | timm_models | eca_botnext26ts_256 | 0.6448 | 0.6238 | | timm_models | rexnet_100 | 0.6445 | 0.762 | | timm_models | res2next50 | 0.6241 | 0.565 | | timm_models | botnet26t_256 | 0.604 | 0.642 | | timm_models | eca_halonext26ts | 0.0 | 0.5691 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+-----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------+------------------------+ | torchbench | yolov3 | 1088.8255 | 1048.3046 | | torchbench | densenet121 | 394.1798 | 395.6151 | | torchbench | timm_efficientdet | 291.7389 | 287.7361 | | torchbench | mobilenet_v3_large | 140.3122 | 140.6717 | | torchbench | timm_efficientnet | 122.592 | 120.6361 | | huggingface | DebertaV2ForMaskedLM | 179.9417 | 56.5487 | | huggingface | DebertaV2ForQuestionAnswering | 179.5773 | 55.6554 | | timm_models | hrnet_w18 | 207.4302 | 195.4759 | | timm_models | pnasnet5large | 198.2433 | 192.3953 | | timm_models | res2net50_14w_8s | 197.0282 | 194.0609 | | timm_models | ghostnet_100 | 194.9782 | 194.4382 | | timm_models | twins_pcpvt_base | 141.6654 | 139.2786 | | timm_models | res2net101_26w_4s | 134.77 | 133.0188 | | timm_models | dpn107 | 129.095 | 126.8431 | | timm_models | rexnet_100 | 122.2382 | 121.3356 | +-------------+-------------------------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+---------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+----------+------------------------+ | torchbench | mobilenet_v2 | 0.885 | 1.0922 | | torchbench | timm_vision_transformer_large | 0.879 | 0.9542 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | Background_Matting | 0.8561 | 1.0424 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | fastNLP_Bert | 0.8521 | 1.0681 | | torchbench | hf_DistilBert | 0.8384 | 0.9048 | | torchbench | timm_regnet | 0.836 | 1.0095 | | torchbench | yolov3 | 0.8316 | 0.9829 | | torchbench | hf_Bart | 0.8224 | 1.0097 | | torchbench | shufflenet_v2_x1_0 | 0.8124 | 0.9457 | | torchbench | resnet152 | 0.8057 | 0.9398 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | pytorch_unet | 0.7877 | 0.7907 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_resnest | 0.7215 | 0.9566 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | timm_vovnet | 0.6882 | 0.8809 | | torchbench | resnet50 | 0.6745 | 0.8696 | | torchbench | mnasnet1_0 | 0.659 | 0.7667 | | torchbench | mobilenet_v3_large | 0.6583 | 0.811 | | torchbench | resnext50_32x4d | 0.651 | 0.7706 | | torchbench | squeezenet1_1 | 0.6336 | 0.7041 | | torchbench | hf_Reformer | 0.5851 | 1.0016 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | resnet18 | 0.5498 | 0.6181 | | torchbench | densenet121 | 0.5389 | 0.6133 | | torchbench | LearningToPaint | 0.4882 | 0.6195 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | dlrm | nan | 0.7306 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | DistilBertForMaskedLM | 0.8716 | 0.9439 | | huggingface | Speech2Text2ForCausalLM | 0.8672 | 0.9793 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | M2M100ForConditionalGeneration | 0.8468 | 1.04 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9649 | | huggingface | DebertaV2ForMaskedLM | 0.6117 | 0.9912 | | huggingface | MobileBertForQuestionAnswering | 0.5988 | 0.8126 | | huggingface | Reformer | 0.5813 | 1.0027 | | huggingface | DebertaV2ForQuestionAnswering | 0.5266 | 0.9885 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1614 | | timm_models | mobilenetv2_100 | 0.8962 | 1.1046 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | mixnet_l | 0.8815 | 0.98 | | timm_models | eca_botnext26ts_256 | 0.8765 | 1.1944 | | timm_models | dla102 | 0.8723 | 1.0162 | | timm_models | fbnetv3_b | 0.8648 | 1.0056 | | timm_models | adv_inception_v3 | 0.8599 | 0.9862 | | timm_models | inception_v3 | 0.8599 | 0.9862 | | timm_models | gluon_inception_v3 | 0.8599 | 0.9862 | | timm_models | swsl_resnext101_32x16d | 0.852 | 0.9728 | | timm_models | dpn107 | 0.8455 | 0.944 | | timm_models | gluon_xception65 | 0.8442 | 0.965 | | timm_models | cspdarknet53 | 0.8368 | 0.9122 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | res2net101_26w_4s | 0.8146 | 0.9442 | | timm_models | resmlp_12_224 | 0.8092 | 0.8239 | | timm_models | ese_vovnet19b_dw | 0.8041 | 1.0135 | | timm_models | convnext_base | 0.8022 | 1.0085 | | timm_models | selecsls42b | 0.7927 | 0.9534 | | timm_models | spnasnet_100 | 0.787 | 0.9293 | | timm_models | coat_lite_mini | 0.7861 | 1.0072 | | timm_models | mnasnet_100 | 0.7727 | 0.9234 | | timm_models | res2net50_14w_8s | 0.7713 | 0.9528 | | timm_models | ghostnet_100 | 0.7707 | 1.0052 | | timm_models | res2next50 | 0.7697 | 0.9414 | | timm_models | hrnet_w18 | 0.7605 | 0.942 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | mobilenetv3_large_100 | 0.75 | 0.9635 | | timm_models | sebotnet33ts_256 | 0.7318 | 0.8133 | | timm_models | gernet_l | 0.7239 | 0.9334 | | timm_models | fbnetc_100 | 0.7101 | 0.9306 | | timm_models | lcnet_050 | 0.6955 | 0.8352 | | timm_models | jx_nest_base | 0.6706 | 0.8617 | | timm_models | botnet26t_256 | 0.6615 | 0.9434 | | timm_models | regnety_002 | 0.5858 | 0.8993 | | timm_models | repvgg_a2 | 0.5572 | 0.8383 | +-------------+---------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Accuracy regressions ~~~ +------------------------+----------------------+-------------+-------------+ | compiler | name | prev_status | cur_status | +------------------------+----------------------+-------------+-------------+ | inductor | functorch_dp_cifar10 | pass | fail_to_run | | inductor_no_cudagraphs | functorch_dp_cifar10 | pass | fail_to_run | +------------------------+----------------------+-------------+-------------+ ~~~ Performance speedup regressions ~~~ +------------------------+----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+----------------------+-------------+------------+ | inductor | timm_regnet | 1.4274 | 0.9343 | | inductor | resnet50 | 1.9102 | 0.9015 | | inductor | mobilenet_v2 | 1.5615 | 0.8572 | | inductor | yolov3 | 1.0901 | 0.8387 | | inductor | functorch_dp_cifar10 | 4.948 | 0.0 | | inductor_no_cudagraphs | timm_vovnet | 1.1448 | 0.896 | | inductor_no_cudagraphs | timm_regnet | 1.2341 | 0.8512 | | inductor_no_cudagraphs | mobilenet_v2 | 1.5186 | 0.8439 | | inductor_no_cudagraphs | yolov3 | 1.0664 | 0.8093 | | inductor_no_cudagraphs | resnet50 | 1.3505 | 0.7594 | | inductor_no_cudagraphs | functorch_dp_cifar10 | 1.3429 | 0.0 | +------------------------+----------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+--------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+--------------------+-------------+------------+ | inductor | densenet121 | 50.8713 | 394.1798 | | inductor | mobilenet_v3_large | 26.1886 | 140.3122 | | inductor | timm_efficientnet | 26.8783 | 122.592 | | inductor_no_cudagraphs | densenet121 | 50.4114 | 395.6151 | | inductor_no_cudagraphs | mobilenet_v3_large | 25.8204 | 140.6717 | | inductor_no_cudagraphs | timm_efficientnet | 26.3009 | 120.6361 | +------------------------+--------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+-----------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------+-------------+------------+ | inductor | mobilenet_v2 | 1.0606 | 0.885 | | inductor | timm_regnet | 0.9344 | 0.836 | | inductor | yolov3 | 0.9063 | 0.8316 | | inductor | resnet152 | 0.9123 | 0.8057 | | inductor | pytorch_unet | 0.9113 | 0.7877 | | inductor_no_cudagraphs | pytorch_unet | 1.0853 | 0.7907 | | inductor_no_cudagraphs | squeezenet1_1 | 1.0608 | 0.7041 | | inductor_no_cudagraphs | LearningToPaint | 0.925 | 0.6195 | | inductor_no_cudagraphs | densenet121 | 1.0051 | 0.6133 | +------------------------+-----------------+-------------+------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Accuracy regressions ~~~ +------------------------+------------------+-------------+-------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+-------------+ | inductor | YituTechConvBert | pass | fail_to_run | | inductor_no_cudagraphs | YituTechConvBert | pass | fail_to_run | +------------------------+------------------+-------------+-------------+ ~~~ Performance speedup regressions ~~~ +------------------------+------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+------------+ | inductor | YituTechConvBert | 4.8043 | 0.0 | | inductor_no_cudagraphs | YituTechConvBert | 1.6603 | 0.0 | +------------------------+------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Accuracy regressions ~~~ +------------------------+------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+---------------+ | inductor | ese_vovnet19b_dw | pass | fail_accuracy | | inductor | ghostnet_100 | pass | fail_accuracy | | inductor | hrnet_w18 | pass | fail_accuracy | | inductor | resnest101e | pass | fail_accuracy | | inductor_no_cudagraphs | ghostnet_100 | pass | fail_accuracy | | inductor_no_cudagraphs | hrnet_w18 | pass | fail_accuracy | | inductor_no_cudagraphs | resnest101e | pass | fail_accuracy | +------------------------+------------------+-------------+---------------+ ~~~ Performance speedup regressions ~~~ +------------------------+------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------------+-------------+------------+ | inductor | fbnetv3_b | 1.41 | 0.9318 | | inductor | selecsls42b | 1.4432 | 0.8724 | | inductor | dla102 | 1.58 | 0.8363 | | inductor | cspdarknet53 | 1.3079 | 0.8101 | | inductor | gernet_l | 1.067 | 0.7873 | | inductor | tf_efficientnet_b0 | 1.3509 | 0.787 | | inductor | res2net101_26w_4s | 1.608 | 0.784 | | inductor | tinynet_a | 1.2581 | 0.78 | | inductor | mobilenetv2_100 | 1.4039 | 0.7787 | | inductor | dpn107 | 1.1539 | 0.7719 | | inductor | resnest101e | 1.5947 | 0.7576 | | inductor | gluon_xception65 | 1.1612 | 0.7503 | | inductor | repvgg_a2 | 1.1047 | 0.7408 | | inductor | mobilevit_s | 1.389 | 0.7378 | | inductor | res2net50_14w_8s | 1.4766 | 0.7271 | | inductor | convmixer_768_32 | 1.0557 | 0.7237 | | inductor | visformer_small | 1.2299 | 0.7106 | | inductor | sebotnet33ts_256 | 1.1979 | 0.6872 | | inductor | ese_vovnet19b_dw | 1.3711 | 0.6602 | | inductor | swsl_resnext101_32x16d | 1.1326 | 0.6538 | | inductor | eca_botnext26ts_256 | 1.2759 | 0.6448 | | inductor | rexnet_100 | 1.2756 | 0.6445 | | inductor | res2next50 | 1.4181 | 0.6241 | | inductor | botnet26t_256 | 1.3282 | 0.604 | | inductor_no_cudagraphs | poolformer_m36 | 1.2964 | 0.9491 | | inductor_no_cudagraphs | tf_efficientnet_b0 | 1.3555 | 0.8935 | | inductor_no_cudagraphs | selecsls42b | 1.4124 | 0.8617 | | inductor_no_cudagraphs | tinynet_a | 1.2676 | 0.8259 | | inductor_no_cudagraphs | dla102 | 1.5506 | 0.8176 | | inductor_no_cudagraphs | cspdarknet53 | 1.3235 | 0.8097 | | inductor_no_cudagraphs | repvgg_a2 | 1.12 | 0.7729 | | inductor_no_cudagraphs | resnest101e | 1.4267 | 0.7694 | | inductor_no_cudagraphs | gernet_l | 1.0778 | 0.7672 | | inductor_no_cudagraphs | rexnet_100 | 1.2781 | 0.762 | | inductor_no_cudagraphs | mobilevit_s | 1.3759 | 0.7497 | | inductor_no_cudagraphs | dpn107 | 1.1799 | 0.7467 | | inductor_no_cudagraphs | res2net101_26w_4s | 1.3248 | 0.7327 | | inductor_no_cudagraphs | sebotnet33ts_256 | 1.2027 | 0.7284 | | inductor_no_cudagraphs | convmixer_768_32 | 1.0508 | 0.7143 | | inductor_no_cudagraphs | visformer_small | 1.1687 | 0.698 | | inductor_no_cudagraphs | ese_vovnet19b_dw | 1.3779 | 0.6811 | | inductor_no_cudagraphs | swsl_resnext101_32x16d | 1.0573 | 0.6637 | | inductor_no_cudagraphs | gluon_xception65 | 1.1259 | 0.6438 | | inductor_no_cudagraphs | botnet26t_256 | 1.3329 | 0.642 | | inductor_no_cudagraphs | mobilenetv2_100 | 1.4317 | 0.6304 | | inductor_no_cudagraphs | eca_botnext26ts_256 | 1.2714 | 0.6238 | | inductor_no_cudagraphs | res2net50_14w_8s | 1.4033 | 0.6206 | | inductor_no_cudagraphs | res2next50 | 1.3459 | 0.565 | +------------------------+------------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+-------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------+-------------+------------+ | inductor | pnasnet5large | 90.6025 | 198.2433 | | inductor | res2net50_14w_8s | 56.6981 | 197.0282 | | inductor | ghostnet_100 | 38.1693 | 194.9782 | | inductor | res2net101_26w_4s | 62.687 | 134.77 | | inductor | dpn107 | 46.5527 | 129.095 | | inductor | rexnet_100 | 31.2198 | 122.2382 | | inductor_no_cudagraphs | ghostnet_100 | 36.3664 | 194.4382 | | inductor_no_cudagraphs | res2net50_14w_8s | 54.3689 | 194.0609 | | inductor_no_cudagraphs | pnasnet5large | 87.3532 | 192.3953 | | inductor_no_cudagraphs | res2net101_26w_4s | 59.32 | 133.0188 | | inductor_no_cudagraphs | dpn107 | 44.843 | 126.8431 | | inductor_no_cudagraphs | rexnet_100 | 29.4947 | 121.3356 | +------------------------+-------------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------------+-------------+------------+ | inductor | mobilenetv2_100 | 1.0587 | 0.8962 | | inductor | eca_botnext26ts_256 | 1.1067 | 0.8765 | | inductor | dla102 | 0.9555 | 0.8723 | | inductor | fbnetv3_b | 0.9862 | 0.8648 | | inductor | swsl_resnext101_32x16d | 0.9111 | 0.852 | | inductor | dpn107 | 0.9069 | 0.8455 | | inductor | ese_vovnet19b_dw | 0.9181 | 0.8041 | | inductor | ghostnet_100 | 0.9489 | 0.7707 | | inductor | mobilenetv3_large_100 | 0.9307 | 0.75 | | inductor_no_cudagraphs | regnety_002 | 1.0078 | 0.8993 | | inductor_no_cudagraphs | lcnet_050 | 0.9432 | 0.8352 | +------------------------+------------------------+-------------+------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0032 | 0.8973 | 2.4106 | 0.7369 | 5.6204 | 1.1364 | | timm_efficientdet | 1 | 0.9835 | 0.81 | 2.0523 | 0.0 | 3.9882 | 1.3947 | | BERT_pytorch | 16 | 1.0123 | 0.8453 | 1.5575 | 0.78 | 3.511 | 2.3949 | | timm_vision_transformer | 8 | 1.0064 | 0.8484 | 1.9027 | 0.5945 | 3.0498 | 1.5793 | | drq | 1 | 1.0006 | 0.8202 | 1.939 | 0.6008 | 3.0004 | 1.1599 | | dcgan | 32 | 0.9831 | 0.9381 | 1.6019 | 0.7105 | 2.8148 | 1.0516 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9976 | 0.984 | 1.7261 | 0.0 | 2.6468 | 1.5523 | | resnet18 | 16 | 1.0017 | 1.005 | 1.6167 | 0.8 | 2.5853 | 1.2099 | | hf_T5_large | 2 | 1.0216 | 0.859 | 0.0 | 0.0 | 2.548 | 2.2612 | | hf_Albert | 8 | 1.0011 | 0.9561 | 0.7746 | 0.0 | 2.3803 | 2.3244 | | resnext50_32x4d | 8 | 1.0001 | 0.9587 | 1.7564 | 0.7472 | 2.3484 | 1.0267 | | mobilenet_v3_large | 32 | 1.0025 | 0.9985 | 1.4737 | 0.7676 | 2.3386 | 1.1367 | | squeezenet1_1 | 32 | 0.9891 | 0.9623 | 1.4439 | 0.7263 | 2.2359 | 1.1646 | | pytorch_struct | 200 | 0.988 | 0.7437 | 0.9921 | 0.5963 | 2.0822 | 1.2724 | | lennard_jones | 1000 | 0.9637 | 0.7691 | 1.2759 | 0.472 | 2.0479 | 1.0465 | | hf_Bert | 4 | 1.0374 | 0.857 | 0.9291 | 0.0 | 2.0208 | 1.8569 | | hf_GPT2 | 4 | 1.0197 | 0.9801 | 0.8091 | 0.2863 | 1.9455 | 1.8805 | | hf_T5 | 8 | 0.9981 | 0.9261 | 0.0 | 1.3573 | 1.8788 | 1.8871 | | mnasnet1_0 | 32 | 0.9973 | 1.0159 | 1.2472 | 0.7703 | 1.8243 | 1.1142 | | hf_Bart | 4 | 1.0119 | 0.8405 | 0.8675 | 0.0 | 1.7467 | 1.7006 | | LearningToPaint | 96 | 1.0021 | 1.0141 | 1.1406 | 0.8514 | 1.7005 | 1.2183 | | soft_actor_critic | 256 | 0.9758 | 0.7394 | 1.2632 | 0.5286 | 1.6898 | 1.0402 | | shufflenet_v2_x1_0 | 128 | 1.0013 | 1.0133 | 1.0311 | 0.8614 | 1.5534 | 1.2443 | | speech_transformer | 32 | 0.9954 | 0.8461 | 1.7026 | 0.6664 | 1.5393 | 1.5751 | | attention_is_all_you_need_pytorch | 256 | 1.0097 | 0.9155 | 0.8256 | 0.0 | 1.5358 | 1.481 | | fastNLP_Bert | 6 | 0.9989 | 0.9063 | 0.7657 | 0.0 | 1.5293 | 1.4737 | | hf_DistilBert | 8 | 1.0 | 0.9711 | 0.7438 | 0.3624 | 1.5084 | 1.4909 | | timm_efficientnet | 32 | 0.9627 | 0.7976 | 1.0834 | 0.7016 | 1.5023 | 1.0055 | | pytorch_stargan | 16 | 0.9964 | 1.1033 | 1.0267 | 0.0 | 1.4457 | 1.4104 | | pytorch_unet | 1 | 0.9991 | 0.2115 | 0.0 | 0.0 | 1.3488 | 1.2839 | | timm_nfnet | 128 | 0.999 | 0.9999 | 0.8811 | 0.9217 | 1.3011 | 1.233 | | vgg16 | 64 | 0.9994 | 0.9973 | 0.8576 | 0.9729 | 1.2701 | 1.2579 | | Super_SloMo | 6 | 0.9991 | 0.1768 | 0.0 | 0.0 | 1.2265 | 1.1961 | | alexnet | 128 | 0.9984 | 0.9974 | 0.8147 | 0.9374 | 1.2099 | 1.2052 | | hf_Reformer | 4 | 0.9977 | 1.0004 | 0.9921 | 0.6545 | 1.1776 | 1.1823 | | resnet152 | 32 | 1.0016 | 0.9959 | 1.2713 | 0.0 | 1.1548 | 0.975 | | Background_Matting | 4 | 1.0004 | 0.1454 | 0.0 | 0.0 | 1.1238 | 1.1011 | | timm_resnest | 32 | 1.0036 | 1.0189 | 0.855 | 0.9615 | 1.1185 | 1.0186 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9904 | 0.0 | 0.0 | 1.1105 | 1.0906 | | tts_angular | 64 | 0.9768 | 0.9418 | 0.9826 | 0.9537 | 1.0142 | 1.0115 | | timm_vovnet | 32 | 0.9198 | 0.894 | 0.8617 | 0.7987 | 1.0024 | 0.896 | | demucs | 4 | 1.0002 | 1.0006 | 1.0011 | 1.0012 | 1.0004 | 1.0008 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9968 | 0.6966 | 1.0076 | 0.9902 | 1.0312 | | timm_regnet | 32 | 0.9806 | 0.9391 | 0.8812 | 0.7833 | 0.9343 | 0.8512 | | resnet50 | 32 | 1.0001 | 1.0122 | 1.0254 | 0.81 | 0.9015 | 0.7594 | | mobilenet_v2 | 96 | 0.9999 | 0.9838 | 0.7598 | 1.0318 | 0.8572 | 0.8439 | | yolov3 | 16 | 0.9995 | 0.9907 | 0.8039 | 0.0 | 0.8387 | 0.8093 | | hf_GPT2_large | 4 | 1.0 | 0.9901 | 0.0 | 0.0 | 0.0 | 1.855 | | dlrm | 2048 | 1.0926 | 1.2024 | 0.0 | 1.0392 | 0.0 | 1.172 | | tacotron2 | 64 | 0.968 | 0.7629 | 0.9789 | 0.6089 | 0.0 | 0.8797 | | functorch_dp_cifar10 | 64 | 0.9995 | 0.9565 | 2.3539 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9586 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | 0.0000 | pass | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | yolov3 | 16 | 3.0703 | 8.2373 | 11.6957 | nan | 1088.8255 | 1048.3046 | | densenet121 | 4 | 2.4627 | 11.801 | 19.0347 | 239.6507 | 394.1798 | 395.6151 | | timm_efficientdet | 1 | 19.8766 | 37.078 | 75.6644 | nan | 291.7389 | 287.7361 | | mobilenet_v3_large | 32 | 1.0554 | 4.9123 | 7.3759 | 117.7416 | 140.3122 | 140.6717 | | timm_efficientnet | 32 | 1.9271 | 6.9126 | 15.6526 | 150.6537 | 122.592 | 120.6361 | | hf_T5_large | 2 | 14.6254 | 38.8398 | nan | nan | 117.9808 | 114.8043 | | mnasnet1_0 | 32 | 0.9645 | 4.3118 | 6.4726 | 83.3975 | 117.6509 | 115.4495 | | resnet152 | 32 | 2.7373 | 13.3395 | 21.7868 | nan | 103.7343 | 102.9433 | | resnext50_32x4d | 8 | 1.0085 | 4.5905 | 7.3229 | 83.7305 | 102.9248 | 102.9912 | | timm_vovnet | 32 | 1.5747 | 4.4188 | 9.934 | 71.1587 | 91.9078 | 91.074 | | mobilenet_v2 | 96 | 0.9194 | 4.7225 | 6.9923 | 114.0968 | 77.163 | 77.7194 | | shufflenet_v2_x1_0 | 128 | 1.1167 | 4.9928 | 7.8076 | 106.5581 | 76.6372 | 75.3444 | | resnet50 | 32 | 1.0181 | 4.6478 | 6.7744 | 95.9692 | 71.8614 | 71.4522 | | timm_vision_transformer_large | 8 | 3.0796 | 15.0099 | nan | nan | 69.225 | 69.0787 | | timm_nfnet | 128 | 2.177 | 7.0819 | 10.5757 | 157.8372 | 63.4377 | 63.0251 | | timm_regnet | 32 | 2.4578 | 8.0646 | 19.3589 | 141.8875 | 63.2732 | 62.5457 | | timm_resnest | 32 | 0.6603 | 2.5069 | 3.9352 | 67.4725 | 60.2162 | 59.7749 | | squeezenet1_1 | 32 | 0.2946 | 0.9399 | 1.4019 | 6.8409 | 52.3368 | 53.4714 | | resnet18 | 16 | 0.4697 | 1.8141 | 2.637 | 39.5448 | 41.6169 | 40.1477 | | LearningToPaint | 96 | 0.4919 | 1.9547 | 2.8543 | 47.2209 | 34.7237 | 33.652 | | timm_vision_transformer | 8 | 1.0406 | 4.5035 | 7.3657 | 84.054 | 33.4958 | 33.2763 | | hf_Bart | 4 | 2.0109 | 8.9614 | 13.7963 | nan | 33.1825 | 32.2557 | | BERT_pytorch | 16 | 1.8247 | 7.5907 | 11.5148 | 106.8255 | 31.7728 | 30.9695 | | attention_is_all_you_need_pytorch | 256 | 1.3937 | 7.1127 | 11.0388 | nan | 31.0041 | 30.1091 | | hf_T5 | 8 | 2.7072 | 8.5968 | nan | 91.1681 | 30.1666 | 29.6199 | | fastNLP_Bert | 6 | 1.887 | 7.0775 | 11.289 | nan | 29.327 | 26.9558 | | Background_Matting | 4 | 1.0516 | 8.9478 | nan | nan | 29.1884 | 28.9858 | | speech_transformer | 32 | 2.0592 | 9.0243 | 33.0474 | 185.743 | 28.0713 | 25.6446 | | pytorch_stargan | 16 | 0.4801 | 1.9622 | 2.8312 | nan | 27.4683 | 25.9921 | | pytorch_struct | 200 | 0.2866 | 0.8558 | 1.4416 | 7.815 | 22.4701 | 22.2789 | | Super_SloMo | 6 | 1.1363 | 7.328 | nan | nan | 20.5861 | 20.0211 | | hf_Bert | 4 | 1.8825 | 7.0359 | 10.1675 | nan | 20.5403 | 19.9379 | | hf_GPT2 | 4 | 1.7923 | 6.301 | 9.3181 | 83.7297 | 20.1995 | 19.7712 | | hf_Albert | 8 | 1.6501 | 6.3907 | 10.06 | nan | 19.8019 | 18.8967 | | hf_Reformer | 4 | 1.686 | 2.8757 | 5.3479 | 16.6017 | 18.0204 | 15.5569 | | hf_DistilBert | 8 | 0.8245 | 3.3928 | 5.9676 | 56.1091 | 13.8504 | 13.9387 | | pytorch_unet | 1 | 0.5022 | 3.0715 | nan | nan | 11.5569 | 11.3212 | | dcgan | 32 | 0.181 | 0.426 | 0.6654 | 4.9766 | 9.7201 | 9.9064 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.5035 | 1.9668 | 2.7584 | nan | 8.6595 | 8.5103 | | drq | 1 | 0.3427 | 0.6394 | 1.0482 | 6.3559 | 4.0712 | 3.4964 | | vgg16 | 64 | 0.1932 | 0.667 | 1.0097 | 5.4176 | 3.8737 | 3.7024 | | nvidia_deeprecommender | 256 | 0.2175 | 0.5251 | 0.8005 | 5.5807 | 3.609 | 3.2091 | | alexnet | 128 | 0.1739 | 0.4449 | 0.7045 | 4.7559 | 3.2653 | 2.9997 | | soft_actor_critic | 256 | 0.2276 | 0.3775 | 0.5995 | 3.2729 | 3.1751 | 2.8644 | | lennard_jones | 1000 | 0.1617 | 0.3516 | 0.5319 | 2.92 | 2.1806 | 1.9501 | | tts_angular | 64 | 0.192 | 0.2375 | 0.3471 | 1.4713 | 1.8152 | 1.7521 | | demucs | 4 | 0.3513 | 0.3503 | 0.3343 | 0.3525 | 0.241 | 0.26 | | hf_GPT2_large | 4 | 5.837 | 19.8062 | nan | nan | nan | 53.1803 | | tacotron2 | 64 | 4.8793 | 18.3821 | 32.8293 | 88.9275 | nan | 43.1028 | | dlrm | 2048 | 0.47 | 0.8317 | nan | 4.7311 | nan | 3.2387 | | functorch_dp_cifar10 | 64 | 0.3472 | 1.4255 | 2.0987 | nan | nan | nan | | hf_BigBird | 2 | 4.0366 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 1.0001 | 0.936 | 0.3268 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3343 | 1.1938 | 1.0923 | 1.0993 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.024 | 1.176 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3084 | nan | 0.9837 | 1.1225 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.4002 | 1.1118 | 0.9743 | 1.1226 | | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.272 | 0.4638 | 0.9696 | 1.2228 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.3799 | 1.1204 | 0.9649 | 1.1241 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9647 | 1.2945 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.999 | 0.8637 | 0.4232 | nan | 0.9506 | 1.0489 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.9309 | 1.252 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3557 | 0.4815 | 0.9298 | 1.0969 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 0.885 | 1.0922 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | nan | 0.879 | 0.9542 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8735 | 0.942 | | Background_Matting | 4 | 1.0138 | 0.6522 | nan | nan | 0.8561 | 1.0424 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8521 | 1.0681 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3413 | 1.0708 | 0.8384 | 0.9048 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3493 | 0.8027 | 0.836 | 1.0095 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3538 | nan | 0.8316 | 0.9829 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3634 | nan | 0.8224 | 1.0097 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8124 | 0.9457 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.8057 | 0.9398 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.775 | 0.7973 | 1.0079 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.7877 | 0.7907 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.734 | 0.7633 | 1.0588 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3481 | 0.8451 | 0.7215 | 0.9566 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.392 | 1.0881 | 0.7151 | 0.7249 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3409 | 0.7755 | 0.6882 | 0.8809 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3562 | 0.7806 | 0.6745 | 0.8696 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3407 | 0.8226 | 0.659 | 0.7667 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3449 | 0.7921 | 0.6583 | 0.811 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3888 | 0.81 | 0.651 | 0.7706 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.6336 | 0.7041 | | hf_Reformer | 4 | 0.9996 | 1.0 | 0.6037 | 0.9999 | 0.5851 | 1.0016 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.5498 | 0.6181 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.5389 | 0.6133 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.4882 | 0.6195 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | 0.7306 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 184.7912 | 186.2848 | nan | nan | 165.7666 | 169.1097 | | Background_Matting | 4 | 134.3174 | 919.1648 | nan | nan | 118.8743 | 121.228 | | timm_nfnet | 128 | 131.939 | 131.1808 | 149.0405 | 143.0828 | 101.2367 | 106.3453 | | hf_T5 | 8 | 175.0195 | 188.2524 | nan | 128.5646 | 92.7572 | 92.4103 | | hf_T5_large | 2 | 215.109 | 256.2903 | nan | nan | 88.9437 | 115.669 | | yolov3 | 16 | 68.6284 | 69.2721 | 85.2401 | nan | 81.7956 | 84.7458 | | resnet152 | 32 | 90.4589 | 90.6057 | 72.9706 | nan | 80.7078 | 97.2061 | | timm_regnet | 32 | 81.3746 | 76.844 | 81.0227 | 91.4673 | 76.8029 | 84.7808 | | hf_Reformer | 4 | 82.3314 | 82.2473 | 82.8587 | 125.3204 | 69.9144 | 69.4545 | | Super_SloMo | 6 | 79.6322 | 448.8646 | nan | nan | 64.824 | 66.3869 | | mobilenet_v2 | 96 | 48.94 | 49.7493 | 64.4389 | 47.3393 | 58.3001 | 59.2209 | | demucs | 4 | 58.6076 | 57.0384 | 57.352 | 57.3484 | 57.2896 | 57.4658 | | vgg16 | 64 | 66.0873 | 66.5165 | 77.3867 | 68.1205 | 52.0435 | 52.5162 | | speech_transformer | 32 | 65.9874 | 77.062 | 34.392 | 100.2161 | 43.6193 | 39.9856 | | timm_efficientdet | 1 | 158.6361 | 196.7189 | 77.4094 | nan | 41.7047 | 117.141 | | resnet50 | 32 | 33.4955 | 33.5685 | 32.4322 | 42.0606 | 37.6383 | 44.9736 | | fastNLP_Bert | 6 | 55.7151 | 61.3066 | 72.7396 | nan | 36.671 | 37.9153 | | attention_is_all_you_need_pytorch | 256 | 51.9157 | 57.4662 | 63.2121 | nan | 34.0858 | 35.3213 | | hf_Bart | 4 | 54.8491 | 67.5265 | 65.0411 | nan | 33.2977 | 34.4789 | | timm_vovnet | 32 | 34.5568 | 35.8578 | 37.0782 | 40.3357 | 31.912 | 38.0758 | | timm_efficientnet | 32 | 48.7221 | 63.1469 | 43.3526 | 71.0328 | 31.7212 | 47.4274 | | pytorch_unet | 1 | 40.0458 | 188.9025 | nan | nan | 29.6069 | 31.1129 | | hf_Albert | 8 | 68.3753 | 71.491 | 88.2493 | nan | 28.7502 | 29.4602 | | shufflenet_v2_x1_0 | 128 | 40.6807 | 39.7079 | 41.7737 | 47.3218 | 26.2782 | 32.6035 | | hf_GPT2 | 4 | 48.3918 | 49.398 | 60.1361 | 172.0328 | 25.5009 | 25.9558 | | timm_resnest | 32 | 24.3389 | 23.9075 | 29.381 | 25.5917 | 22.4309 | 24.2864 | | hf_Bert | 4 | 40.1385 | 47.6189 | 44.1776 | nan | 21.0906 | 23.0811 | | hf_DistilBert | 8 | 31.1712 | 32.0196 | 41.9591 | 85.998 | 20.6032 | 21.1789 | | mnasnet1_0 | 32 | 29.4221 | 28.5655 | 23.1961 | 37.962 | 16.2296 | 26.8195 | | BERT_pytorch | 16 | 52.8131 | 64.5081 | 35.177 | 68.63 | 16.0118 | 23.409 | | mobilenet_v3_large | 32 | 35.3938 | 35.2607 | 24.0293 | 46.0031 | 15.6534 | 32.294 | | densenet121 | 4 | 74.3725 | 92.6727 | 30.5667 | 102.899 | 13.8379 | 67.7774 | | resnext50_32x4d | 8 | 28.7309 | 29.6784 | 19.2266 | 38.855 | 13.6702 | 30.8982 | | pytorch_stargan | 16 | 16.017 | 14.4837 | 15.5183 | nan | 11.1253 | 11.4351 | | nvidia_deeprecommender | 256 | 10.3879 | 10.4167 | 14.8867 | 10.3071 | 10.4771 | 10.0586 | | timm_vision_transformer | 8 | 29.1341 | 34.1464 | 18.2403 | 48.692 | 9.9766 | 19.3276 | | LearningToPaint | 96 | 14.7028 | 14.8927 | 12.9184 | 18.3773 | 8.8806 | 12.2087 | | alexnet | 128 | 9.8204 | 9.8449 | 12.0442 | 10.5138 | 8.1198 | 8.137 | | squeezenet1_1 | 32 | 15.725 | 15.2803 | 10.1436 | 20.3634 | 7.373 | 12.8486 | | pytorch_CycleGAN_and_pix2pix | 1 | 17.7225 | 17.7248 | 10.189 | nan | 6.7521 | 11.4642 | | tts_angular | 64 | 6.4456 | 6.5894 | 6.7688 | 6.9391 | 6.6337 | 6.3682 | | resnet18 | 16 | 12.9977 | 13.0497 | 8.6845 | 16.3702 | 5.0347 | 10.8503 | | pytorch_struct | 200 | 4.5345 | 6.1302 | 4.5364 | 7.5602 | 2.24 | 3.6701 | | drq | 1 | 4.2155 | 4.6989 | 1.969 | 6.5218 | 1.358 | 3.851 | | dcgan | 32 | 3.1741 | 3.3225 | 1.8962 | 4.351 | 1.1009 | 3.0535 | | soft_actor_critic | 256 | 1.3965 | 1.9003 | 1.271 | 2.7521 | 0.888 | 1.3819 | | lennard_jones | 1000 | 1.4703 | 1.891 | 1.1299 | 3.1352 | 0.7412 | 1.4508 | | tacotron2 | 64 | 3034.4738 | 4455.4998 | 3465.478 | 5231.4115 | nan | 3420.2877 | | dlrm | 2048 | 522.0778 | 476.2121 | nan | 527.498 | nan | 466.2965 | | hf_GPT2_large | 4 | 209.6163 | 212.1006 | nan | nan | nan | 113.1483 | | functorch_dp_cifar10 | 64 | 14.1285 | 14.8192 | 5.9071 | nan | nan | nan | | hf_BigBird | 2 | 199.4072 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 1.0208 | 0.8359 | 1.2379 | 0.0 | 2.9112 | 1.7978 | | MT5ForConditionalGeneration | 16 | 1.02 | 0.8921 | 1.0885 | 0.8906 | 2.733 | 2.0139 | | OPTForCausalLM | 2 | 1.0003 | 0.9482 | 0.0 | 0.8061 | 2.389 | 2.3767 | | GPT2ForSequenceClassification | 4 | 0.9979 | 0.9766 | 0.0 | 0.5044 | 2.3146 | 2.3062 | | MobileBertForQuestionAnswering | 128 | 1.02 | 0.8296 | 0.9901 | 0.0 | 2.198 | 1.7343 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9778 | 0.7673 | 0.0 | 2.111 | 2.0526 | | XLNetLMHeadModel | 8 | 1.001 | 0.9716 | 0.0 | 0.0 | 1.9218 | 1.9226 | | RobertaForQuestionAnswering | 16 | 0.9997 | 0.9775 | 0.7651 | 0.0 | 1.8274 | 1.7633 | | ElectraForCausalLM | 32 | 1.0004 | 0.9403 | 0.714 | 0.0 | 1.8214 | 1.8242 | | BertForQuestionAnswering | 16 | 1.0008 | 0.9785 | 0.7715 | 0.0 | 1.8177 | 1.77 | | LayoutLMForSequenceClassification | 16 | 1.0003 | 0.9793 | 0.7737 | 0.0 | 1.8163 | 1.8057 | | MegatronBertForCausalLM | 4 | 1.0117 | 0.898 | 0.7622 | 0.0 | 1.8056 | 1.5565 | | XGLMForCausalLM | 8 | 1.0104 | 0.8283 | 0.9106 | 0.0 | 1.7048 | 1.7628 | | RobertaForCausalLM | 16 | 1.0001 | 0.9714 | 0.7564 | 0.0 | 1.6987 | 1.6826 | | DistillGPT2 | 16 | 0.9995 | 0.9671 | 0.7633 | 0.7562 | 1.6845 | 1.7145 | | AlbertForQuestionAnswering | 4 | 0.9997 | 0.886 | 0.0 | 0.0 | 1.6636 | 1.657 | | PLBartForConditionalGeneration | 4 | 0.9995 | 0.9548 | 0.7381 | 0.0 | 1.6563 | 1.6477 | | AlbertForMaskedLM | 4 | 1.0003 | 0.8856 | 0.0 | 0.0 | 1.6522 | 1.643 | | MegatronBertForQuestionAnswering | 8 | 0.9996 | 0.9706 | 0.7656 | 0.0 | 1.6236 | 1.5805 | | LayoutLMForMaskedLM | 16 | 1.0005 | 0.971 | 0.7565 | 0.0 | 1.6227 | 1.6021 | | T5Small | 4 | 1.0046 | 0.9206 | 0.7564 | 1.1604 | 1.6098 | 1.6646 | | T5ForConditionalGeneration | 4 | 1.0019 | 0.9048 | 0.7572 | 1.1654 | 1.6071 | 1.6627 | | BertForMaskedLM | 16 | 1.0005 | 0.9705 | 0.7539 | 0.0 | 1.5974 | 1.5829 | | PLBartForCausalLM | 8 | 0.9993 | 0.9641 | 0.7612 | 0.9803 | 1.5802 | 1.6022 | | M2M100ForConditionalGeneration | 16 | 1.0924 | 0.8518 | 0.8452 | 0.632 | 1.5752 | 1.7806 | | MBartForConditionalGeneration | 2 | 1.004 | 0.971 | 0.0 | 0.8015 | 1.5307 | 1.4285 | | CamemBert | 16 | 1.0003 | 0.9724 | 0.7655 | 0.0 | 1.5296 | 1.5204 | | BartForConditionalGeneration | 2 | 1.0043 | 0.9511 | 0.0 | 0.0 | 1.521 | 1.4187 | | DistilBertForQuestionAnswering | 256 | 0.9998 | 0.9948 | 0.7578 | 0.6638 | 1.5121 | 1.49 | | MBartForCausalLM | 4 | 1.0 | 0.9638 | 0.7553 | 0.996 | 1.4334 | 1.4336 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0051 | 0.9206 | 0.7333 | 0.0 | 1.4047 | 1.3966 | | BartForCausalLM | 4 | 1.0002 | 0.9695 | 0.758 | 0.0 | 1.401 | 1.4212 | | Speech2Text2ForCausalLM | 256 | 0.9984 | 0.9443 | 0.6887 | 0.9073 | 1.3351 | 1.3669 | | DebertaForMaskedLM | 4 | 0.8908 | 0.7181 | 0.7784 | 0.0 | 1.3176 | 1.1302 | | PegasusForConditionalGeneration | 32 | 1.0011 | 0.9558 | 0.0 | 0.8205 | 1.267 | 1.2559 | | TrOCRForCausalLM | 32 | 1.0001 | 0.9639 | 0.0 | 0.0 | 1.2432 | 1.2576 | | DebertaV2ForMaskedLM | 1 | 0.8815 | 0.7008 | 0.7701 | 0.0 | 1.2334 | 0.9081 | | DistilBertForMaskedLM | 128 | 0.9997 | 0.9593 | 0.715 | 0.6569 | 1.2237 | 1.2359 | | BlenderbotSmallForCausalLM | 64 | 1.0004 | 0.9283 | 0.7156 | 0.0 | 1.2101 | 1.2241 | | Reformer | 16 | 0.9997 | 1.0003 | 0.9799 | 0.9405 | 1.1809 | 1.1966 | | PegasusForCausalLM | 32 | 1.0018 | 0.9523 | 0.7506 | 0.8613 | 1.1628 | 1.1616 | | DebertaForQuestionAnswering | 8 | 0.9762 | 0.8175 | 0.7242 | 0.0 | 1.1443 | 1.2335 | | DebertaV2ForQuestionAnswering | 2 | 0.8829 | 0.6909 | 0.0 | 0.0 | 1.0925 | 0.9445 | | BlenderbotForCausalLM | 4 | 1.0116 | 0.0 | 0.0 | 0.0 | 0.0 | 1.1688 | | YituTechConvBert | 16 | 1.0 | 0.9668 | 0.7888 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | DebertaV2ForMaskedLM | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | BlenderbotForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaV2ForMaskedLM | 1 | 8.6553 | 19.7392 | 79.8835 | nan | 179.9417 | 56.5487 | | DebertaV2ForQuestionAnswering | 2 | 8.8545 | 18.8745 | nan | nan | 179.5773 | 55.6554 | | XLNetLMHeadModel | 8 | 4.8778 | 19.3727 | nan | nan | 115.9403 | 118.8197 | | DebertaForQuestionAnswering | 8 | 4.9424 | 11.7741 | 35.5833 | nan | 105.2156 | 39.1031 | | DebertaForMaskedLM | 4 | 4.9422 | 10.8902 | 35.6165 | nan | 99.2784 | 39.054 | | XGLMForCausalLM | 8 | 3.0679 | 13.7434 | 28.3778 | nan | 76.8484 | 74.437 | | MobileBertForQuestionAnswering | 128 | 9.8559 | 32.072 | 58.9893 | nan | 70.6794 | 69.0563 | | MobileBertForMaskedLM | 64 | 9.8661 | 32.0915 | 58.9628 | nan | 69.3819 | 67.6998 | | M2M100ForConditionalGeneration | 16 | 3.7665 | 17.0864 | 28.9788 | 375.6289 | 63.0494 | 56.0462 | | MT5ForConditionalGeneration | 16 | 4.0033 | 13.035 | 21.0118 | 153.9684 | 54.7566 | 53.8956 | | PegasusForConditionalGeneration | 32 | 3.6855 | 16.4555 | nan | 374.2039 | 53.0441 | 48.6376 | | BartForConditionalGeneration | 2 | 3.9632 | 18.0675 | nan | nan | 51.818 | 49.7069 | | MBartForConditionalGeneration | 2 | 3.8527 | 17.1332 | nan | 402.3115 | 51.5053 | 49.3614 | | MegatronBertForCausalLM | 4 | 3.901 | 14.3672 | 22.5845 | nan | 40.9585 | 39.6605 | | MegatronBertForQuestionAnswering | 8 | 3.9612 | 14.6625 | 23.3526 | nan | 39.9057 | 39.1821 | | BlenderbotSmallForConditionalGeneration | 64 | 2.5132 | 11.3271 | 18.3666 | nan | 35.2857 | 33.5735 | | T5Small | 4 | 2.7626 | 8.8153 | 13.8472 | 94.4455 | 31.4847 | 30.3852 | | T5ForConditionalGeneration | 4 | 2.7789 | 8.6624 | 13.5905 | 92.6412 | 31.3792 | 30.338 | | PLBartForConditionalGeneration | 4 | 2.089 | 8.9462 | 13.9474 | nan | 29.6772 | 29.2984 | | LayoutLMForSequenceClassification | 16 | 2.2791 | 7.7596 | 11.6182 | nan | 29.0485 | 28.2891 | | ElectraForCausalLM | 32 | 1.9249 | 7.2086 | 11.2864 | nan | 26.539 | 25.5229 | | PegasusForCausalLM | 32 | 1.5122 | 6.5587 | 10.3259 | 105.553 | 25.0318 | 23.7799 | | MBartForCausalLM | 4 | 1.5507 | 6.8422 | 10.0932 | 112.0835 | 24.1721 | 23.1583 | | BartForCausalLM | 4 | 1.5102 | 6.7138 | 9.8508 | nan | 23.4414 | 22.1865 | | LayoutLMForMaskedLM | 16 | 2.2584 | 7.838 | 11.8422 | nan | 23.2166 | 22.2155 | | TrOCRForCausalLM | 32 | 1.4665 | 6.5592 | nan | nan | 22.5314 | 21.9917 | | BertForMaskedLM | 16 | 1.8842 | 7.3149 | 10.5596 | nan | 21.6733 | 20.921 | | OPTForCausalLM | 2 | 1.5665 | 6.7985 | nan | 103.9763 | 21.667 | 20.6048 | | RobertaForCausalLM | 16 | 1.949 | 7.1361 | 11.1384 | nan | 21.3093 | 20.5947 | | ElectraForQuestionAnswering | 64 | 1.9389 | 7.2114 | 10.893 | nan | 21.2764 | 21.2039 | | BertForQuestionAnswering | 16 | 1.9211 | 7.2752 | 10.6879 | nan | 20.9947 | 20.2868 | | CamemBert | 16 | 1.9488 | 7.2127 | 10.8943 | nan | 20.6713 | 20.0697 | | RobertaForQuestionAnswering | 16 | 2.0076 | 7.2309 | 10.6075 | nan | 19.8629 | 19.0999 | | GPT2ForSequenceClassification | 4 | 1.8084 | 6.4288 | nan | 85.2785 | 19.0915 | 19.0725 | | AlbertForMaskedLM | 4 | 1.698 | 6.8668 | nan | nan | 18.687 | 18.0884 | | AlbertForQuestionAnswering | 4 | 1.6892 | 6.9001 | nan | nan | 18.0945 | 17.5297 | | Reformer | 16 | 1.5239 | 2.9688 | 5.895 | 17.9955 | 16.7231 | 14.0784 | | BlenderbotSmallForCausalLM | 64 | 1.0164 | 4.4129 | 6.7499 | nan | 16.4451 | 16.6373 | | Speech2Text2ForCausalLM | 256 | 0.8902 | 3.4111 | 5.7445 | 50.4782 | 14.4181 | 13.4378 | | DistillGPT2 | 16 | 0.9335 | 3.4912 | 5.0758 | 42.5321 | 14.0637 | 13.3707 | | PLBartForCausalLM | 8 | 0.8518 | 3.4359 | 5.2857 | 66.1237 | 13.9257 | 13.8275 | | DistilBertForMaskedLM | 128 | 0.8688 | 3.5717 | 6.2535 | 57.5602 | 12.3843 | 12.5327 | | DistilBertForQuestionAnswering | 256 | 0.8614 | 3.5985 | 5.8109 | 56.4613 | 11.7273 | 11.5997 | | BlenderbotForCausalLM | 4 | 2.8694 | nan | nan | nan | nan | 43.1989 | | YituTechConvBert | 16 | 2.7148 | 10.6458 | 17.1653 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9997 | 0.9183 | nan | 1.2641 | 1.2906 | 1.345 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.2229 | 1.0775 | 1.1712 | | MBartForCausalLM | 4 | 1.0 | 0.8998 | 0.3747 | 1.3748 | 1.0747 | 1.1342 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | XLNetLMHeadModel | 8 | 0.9999 | 0.9214 | nan | nan | 1.0303 | 1.0303 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9035 | nan | 1.3227 | 1.0148 | 1.2186 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | PegasusForConditionalGeneration | 32 | 0.9979 | 0.9502 | nan | 1.2087 | 1.0039 | 1.1394 | | RobertaForQuestionAnswering | 16 | 1.004 | 0.9315 | 0.3619 | nan | 1.0036 | 1.0618 | | BertForQuestionAnswering | 16 | 1.004 | 0.9312 | 0.3618 | nan | 1.0029 | 1.0617 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9976 | 1.1976 | | DistilBertForQuestionAnswering | 256 | 1.0112 | 0.9568 | 0.3185 | 1.1483 | 0.9806 | 1.0864 | | DistillGPT2 | 16 | 1.0 | 0.8673 | 0.3596 | 1.1412 | 0.9755 | 1.0618 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1321 | 0.9708 | 1.0363 | | T5Small | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | PLBartForConditionalGeneration | 4 | 0.9997 | 0.9325 | 0.3746 | nan | 0.9651 | 1.0848 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | MegatronBertForQuestionAnswering | 8 | 1.0006 | 0.9101 | 0.3721 | nan | 0.9562 | 1.0239 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | BertForMaskedLM | 16 | 1.0001 | 0.9237 | 0.3656 | nan | 0.9481 | 0.9849 | | RobertaForCausalLM | 16 | 1.0 | 0.9237 | 0.3654 | nan | 0.9475 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9212 | 0.3657 | nan | 0.9446 | 0.983 | | TrOCRForCausalLM | 32 | 0.9998 | 0.8789 | nan | nan | 0.9345 | 1.0129 | | MT5ForConditionalGeneration | 16 | 1.0015 | 0.864 | 0.4151 | 1.0159 | 0.9203 | 1.0032 | | PLBartForCausalLM | 8 | 0.9999 | 0.8707 | 0.3624 | 1.0907 | 0.9166 | 0.989 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8798 | 0.3875 | nan | 0.9121 | 1.0221 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8497 | 0.3516 | 1.0867 | 0.8716 | 0.9439 | | Speech2Text2ForCausalLM | 256 | 0.9668 | 0.839 | 0.3505 | 1.0447 | 0.8672 | 0.9793 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.856 | 0.9327 | | M2M100ForConditionalGeneration | 16 | 1.0098 | 0.9205 | 0.4099 | 1.0614 | 0.8468 | 1.04 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9902 | | MobileBertForMaskedLM | 64 | 0.9999 | 0.8791 | 0.3355 | nan | 0.6698 | 0.9649 | | DebertaV2ForMaskedLM | 1 | 0.9982 | 0.941 | 0.4917 | nan | 0.6117 | 0.9912 | | MobileBertForQuestionAnswering | 128 | 1.0159 | 1.0063 | 0.306 | nan | 0.5988 | 0.8126 | | Reformer | 16 | 0.9771 | 0.9998 | 0.5635 | 0.9998 | 0.5813 | 1.0027 | | DebertaV2ForQuestionAnswering | 2 | 0.9796 | 0.9796 | nan | nan | 0.5266 | 0.9885 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3624 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3252 | nan | 0.3071 | 1.1614 | | BlenderbotForCausalLM | 4 | 1.0002 | nan | nan | nan | nan | 0.9343 | | YituTechConvBert | 16 | 0.9954 | 0.9136 | 0.3774 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | Reformer | 16 | 299.8727 | 299.9389 | 306.4396 | 318.8308 | 254.1805 | 250.5646 | | AlbertForMaskedLM | 4 | 267.0159 | 301.9618 | nan | nan | 161.9362 | 162.7812 | | AlbertForQuestionAnswering | 4 | 265.0347 | 299.3257 | nan | nan | 159.5329 | 160.0368 | | XLNetLMHeadModel | 8 | 275.6097 | 284.0695 | nan | nan | 143.3345 | 143.2304 | | PegasusForConditionalGeneration | 32 | 140.4838 | 147.0671 | nan | 171.6187 | 111.2645 | 112.8279 | | TrOCRForCausalLM | 32 | 137.4733 | 142.1694 | nan | nan | 109.6438 | 108.9484 | | DebertaV2ForQuestionAnswering | 2 | 117.1629 | 149.0775 | nan | nan | 95.1649 | 126.5363 | | BartForConditionalGeneration | 2 | 135.0124 | 150.2864 | nan | nan | 93.9996 | 96.1722 | | MBartForConditionalGeneration | 2 | 135.3153 | 140.0084 | nan | 169.0876 | 93.1862 | 95.1063 | | MegatronBertForQuestionAnswering | 8 | 140.8527 | 145.5226 | 186.7422 | nan | 86.9711 | 89.3803 | | DebertaV2ForMaskedLM | 1 | 110.4067 | 156.1872 | 129.7239 | nan | 83.1925 | 113.5219 | | MobileBertForQuestionAnswering | 128 | 171.0109 | 209.8607 | 182.9888 | nan | 83.1463 | 105.0212 | | BartForCausalLM | 4 | 112.1825 | 116.1964 | 148.305 | nan | 78.9214 | 78.9669 | | BlenderbotSmallForConditionalGeneration | 64 | 109.465 | 119.1541 | 150.0955 | nan | 78.6345 | 78.475 | | MBartForCausalLM | 4 | 112.4334 | 116.7246 | 146.6273 | 112.7563 | 78.1935 | 78.2954 | | CamemBert | 16 | 118.177 | 121.3622 | 154.3758 | nan | 77.1357 | 77.7846 | | PLBartForCausalLM | 8 | 112.9763 | 116.0123 | 144.5895 | 114.0021 | 71.2553 | 70.415 | | M2M100ForConditionalGeneration | 16 | 98.6963 | 126.872 | 127.706 | 168.6009 | 70.6795 | 72.5444 | | PLBartForConditionalGeneration | 4 | 116.1647 | 121.5262 | 157.6636 | nan | 70.004 | 70.2197 | | OPTForCausalLM | 2 | 165.56 | 174.4672 | nan | 205.502 | 69.2614 | 69.6287 | | LayoutLMForMaskedLM | 16 | 111.9724 | 115.6075 | 148.137 | nan | 69.1937 | 70.2253 | | DistilBertForMaskedLM | 128 | 84.142 | 87.6889 | 117.8968 | 128.1406 | 68.7515 | 68.1096 | | BertForMaskedLM | 16 | 109.4423 | 112.8365 | 145.5109 | nan | 68.5846 | 69.2204 | | DistilBertForQuestionAnswering | 256 | 102.6963 | 103.3513 | 135.7565 | 154.8999 | 68.1805 | 69.2286 | | RobertaForCausalLM | 16 | 114.4699 | 117.8248 | 151.6209 | nan | 67.4244 | 68.0711 | | DebertaForQuestionAnswering | 8 | 77.0169 | 92.5509 | 103.9868 | nan | 65.8594 | 60.9374 | | MobileBertForMaskedLM | 64 | 172.1292 | 214.6192 | 142.9037 | nan | 64.4479 | 102.7613 | | T5Small | 4 | 101.4598 | 109.6159 | 134.3841 | 86.9534 | 63.3434 | 60.7182 | | T5ForConditionalGeneration | 4 | 101.0915 | 112.2554 | 134.8586 | 86.258 | 63.2509 | 60.6937 | | DistillGPT2 | 16 | 105.6969 | 109.4455 | 138.5156 | 139.7859 | 62.8603 | 61.6828 | | PegasusForCausalLM | 32 | 69.2383 | 72.1972 | 91.7489 | 79.8819 | 59.7533 | 59.7637 | | ElectraForQuestionAnswering | 64 | 114.648 | 117.0475 | 149.4921 | nan | 54.3108 | 55.7892 | | MegatronBertForCausalLM | 4 | 85.3716 | 100.9503 | 114.456 | nan | 54.1931 | 56.0263 | | DebertaForMaskedLM | 4 | 68.1975 | 84.1462 | 79.0122 | nan | 53.67 | 55.9285 | | LayoutLMForSequenceClassification | 16 | 97.1567 | 99.4611 | 125.6888 | nan | 53.5551 | 53.8956 | | XGLMForCausalLM | 8 | 85.8411 | 106.712 | 94.0417 | nan | 52.6797 | 58.5968 | | BertForQuestionAnswering | 16 | 94.853 | 96.8669 | 124.0438 | nan | 52.1819 | 53.477 | | RobertaForQuestionAnswering | 16 | 95.3484 | 97.3925 | 124.5103 | nan | 52.028 | 53.9139 | | BlenderbotSmallForCausalLM | 64 | 58.7711 | 63.4115 | 81.6316 | nan | 48.4291 | 48.331 | | ElectraForCausalLM | 32 | 87.2009 | 92.7299 | 122.342 | nan | 47.9065 | 47.776 | | MT5ForConditionalGeneration | 16 | 92.9104 | 106.2693 | 85.7228 | 104.2888 | 40.3269 | 46.8923 | | Speech2Text2ForCausalLM | 256 | 53.0239 | 56.0441 | 76.9458 | 58.2366 | 39.7083 | 38.7124 | | GPT2ForSequenceClassification | 4 | 91.0825 | 92.8174 | nan | 179.7512 | 39.2483 | 40.1766 | | BlenderbotForCausalLM | 4 | 90.9062 | nan | nan | nan | nan | 79.1954 | | YituTechConvBert | 16 | 133.1978 | 137.7532 | 169.1462 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 0.9992 | 0.0 | 0.0 | 0.0 | 2.1024 | 1.7365 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9982 | 0.0 | 0.0 | 1.9902 | 1.9541 | | coat_lite_mini | 128 | 0.9998 | 0.9958 | 0.8461 | 1.1514 | 1.8221 | 1.7429 | | twins_pcpvt_base | 64 | 1.006 | 0.9202 | 0.9194 | 0.0 | 1.8126 | 1.6795 | | regnety_002 | 128 | 0.9773 | 0.9423 | 1.1078 | 0.8581 | 1.721 | 1.0655 | | ghostnet_100 | 128 | 1.004 | 0.9736 | 0.8783 | 1.0185 | 1.6978 | 1.4156 | | volo_d1_224 | 64 | 0.9999 | 0.9924 | 0.8462 | 0.0 | 1.5904 | 1.5572 | | gmixer_24_224 | 128 | 0.9999 | 0.8801 | 0.7222 | 0.9268 | 1.5533 | 1.5081 | | gmlp_s16_224 | 128 | 0.9996 | 0.9959 | 0.7866 | 1.0117 | 1.5342 | 1.4975 | | lcnet_050 | 128 | 0.9654 | 0.9444 | 0.8415 | 1.0294 | 1.5065 | 1.3288 | | cait_m36_384 | 4 | 1.0 | 0.9887 | 0.0 | 0.0 | 1.4784 | 1.4495 | | swin_base_patch4_window7_224 | 64 | 0.9995 | 0.9607 | 0.0 | 0.0 | 1.4683 | 1.4151 | | jx_nest_base | 32 | 0.9997 | 0.9928 | 0.7998 | 0.0 | 1.3894 | 1.3563 | | crossvit_9_240 | 128 | 0.9999 | 0.995 | 0.8389 | 0.9197 | 1.3266 | 1.3068 | | convit_base | 64 | 0.9999 | 0.9965 | 0.8338 | 1.2323 | 1.3079 | 1.3825 | | dm_nfnet_f0 | 128 | 0.9985 | 0.9994 | 0.8795 | 0.9251 | 1.3031 | 1.2388 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9919 | 0.7979 | 0.976 | 1.2813 | 1.262 | | pit_b_224 | 64 | 0.9996 | 0.9952 | 0.8222 | 0.9721 | 1.2781 | 1.2714 | | mixer_b16_224 | 128 | 0.9996 | 0.9976 | 0.8025 | 0.9015 | 1.2729 | 1.2697 | | nfnet_l0 | 128 | 1.0003 | 0.8099 | 0.7147 | 0.8523 | 1.2423 | 1.1869 | | convnext_base | 64 | 0.9992 | 0.9951 | 0.8001 | 0.0 | 1.2302 | 1.1412 | | hrnet_w18 | 128 | 1.003 | 1.0181 | 0.8552 | 0.0 | 1.2206 | 1.1804 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9789 | 0.0 | 0.0 | 1.214 | 1.1972 | | adv_inception_v3 | 128 | 1.0 | 0.9963 | 0.8538 | 1.1346 | 1.1536 | 1.113 | | resmlp_12_224 | 128 | 1.0001 | 0.9993 | 0.781 | 1.4636 | 1.1431 | 1.1474 | | vit_base_patch16_224 | 64 | 1.0 | 0.9938 | 0.8356 | 0.9122 | 1.1239 | 1.1178 | | gluon_inception_v3 | 128 | 0.9998 | 0.996 | 0.8539 | 1.1425 | 1.1187 | 1.1015 | | pnasnet5large | 16 | 1.0057 | 1.0355 | 0.847 | 0.0 | 1.1155 | 1.0609 | | inception_v3 | 128 | 1.0 | 0.9966 | 0.8533 | 1.1425 | 1.1086 | 1.0948 | | mnasnet_100 | 128 | 0.9547 | 0.9409 | 0.7886 | 1.2083 | 1.0872 | 1.0816 | | spnasnet_100 | 128 | 0.9477 | 0.9363 | 0.7764 | 1.1013 | 1.0196 | 1.0689 | | fbnetc_100 | 128 | 0.9532 | 0.9448 | 0.7928 | 1.1674 | 1.0087 | 1.0805 | | tf_mixnet_l | 128 | 0.9811 | 0.9098 | 0.7953 | 0.0 | 0.9902 | 0.99 | | mobilenetv3_large_100 | 128 | 0.9557 | 0.9452 | 0.7835 | 1.0018 | 0.9845 | 1.0908 | | mixnet_l | 128 | 0.9804 | 0.9059 | 0.7939 | 0.0 | 0.9833 | 0.9822 | | poolformer_m36 | 64 | 0.9995 | 0.9982 | 0.8072 | 0.0 | 0.9595 | 0.9491 | | fbnetv3_b | 128 | 0.9529 | 0.9419 | 0.7791 | 0.0 | 0.9318 | 0.9635 | | selecsls42b | 128 | 0.9997 | 0.9953 | 0.8415 | 1.2699 | 0.8724 | 0.8617 | | dla102 | 128 | 1.0001 | 0.9961 | 0.8379 | 1.3148 | 0.8363 | 0.8176 | | cspdarknet53 | 64 | 0.943 | 0.9354 | 0.7559 | 1.142 | 0.8101 | 0.8097 | | gernet_l | 128 | 0.9477 | 0.9393 | 0.769 | 1.0624 | 0.7873 | 0.7672 | | tf_efficientnet_b0 | 128 | 0.966 | 0.8069 | 0.6671 | 0.9524 | 0.787 | 0.8935 | | res2net101_26w_4s | 64 | 1.0031 | 1.0037 | 0.9419 | 0.0 | 0.784 | 0.7327 | | tinynet_a | 128 | 0.9691 | 0.8021 | 0.6511 | 0.7928 | 0.78 | 0.8259 | | mobilenetv2_100 | 128 | 0.9506 | 0.9421 | 0.7219 | 1.0835 | 0.7787 | 0.6304 | | dpn107 | 32 | 0.9307 | 0.9408 | 0.7504 | 0.0 | 0.7719 | 0.7467 | | resnest101e | 64 | 0.9996 | 0.992 | 0.8148 | 0.0 | 0.7576 | 0.7694 | | gluon_xception65 | 32 | 0.9998 | 0.9888 | 0.754 | 0.0 | 0.7503 | 0.6438 | | repvgg_a2 | 128 | 0.9433 | 0.936 | 0.7952 | 1.0708 | 0.7408 | 0.7729 | | mobilevit_s | 64 | 0.9728 | 0.8153 | 0.6566 | 0.0 | 0.7378 | 0.7497 | | res2net50_14w_8s | 128 | 1.0012 | 0.9931 | 0.8098 | 1.0102 | 0.7271 | 0.6206 | | convmixer_768_32 | 32 | 0.999 | 0.998 | 0.9229 | 0.0 | 0.7237 | 0.7143 | | visformer_small | 128 | 0.9993 | 1.002 | 0.8401 | 0.0 | 0.7106 | 0.698 | | sebotnet33ts_256 | 64 | 0.9669 | 0.8368 | 0.6796 | 0.9635 | 0.6872 | 0.7284 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9652 | 0.7681 | 1.1279 | 0.6602 | 0.6811 | | swsl_resnext101_32x16d | 32 | 0.9996 | 0.9807 | 0.8086 | 0.0 | 0.6538 | 0.6637 | | eca_botnext26ts_256 | 128 | 0.981 | 0.8118 | 0.6719 | 1.0709 | 0.6448 | 0.6238 | | rexnet_100 | 128 | 0.9651 | 0.8507 | 0.69 | 0.0 | 0.6445 | 0.762 | | res2next50 | 128 | 0.9999 | 0.9958 | 0.8331 | 1.1449 | 0.6241 | 0.565 | | botnet26t_256 | 128 | 0.9801 | 0.9753 | 0.8124 | 1.2771 | 0.604 | 0.642 | | eca_halonext26ts | 128 | 0.9815 | 0.816 | 0.6794 | 0.0 | 0.0 | 0.5691 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | fail_accuracy | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.8178 | 30.0453 | 56.1847 | nan | 207.4302 | 195.4759 | | pnasnet5large | 16 | 5.2592 | 22.7544 | 41.7831 | nan | 198.2433 | 192.3953 | | res2net50_14w_8s | 128 | 3.1713 | 14.4814 | 24.3795 | 341.6674 | 197.0282 | 194.0609 | | ghostnet_100 | 128 | 3.275 | 9.7479 | 14.3758 | 192.7155 | 194.9782 | 194.4382 | | twins_pcpvt_base | 64 | 2.9881 | 14.6014 | 26.066 | nan | 141.6654 | 139.2786 | | res2net101_26w_4s | 64 | 3.4799 | 16.5154 | 27.9364 | nan | 134.77 | 133.0188 | | dpn107 | 32 | 4.0396 | 14.4689 | 39.1782 | nan | 129.095 | 126.8431 | | rexnet_100 | 128 | 2.0311 | 7.5013 | 16.9708 | nan | 122.2382 | 121.3356 | | mobilevit_s | 64 | 2.0799 | 7.4344 | 15.1729 | nan | 117.0541 | 117.0596 | | fbnetv3_b | 128 | 3.3519 | 11.4631 | 28.3919 | nan | 114.3903 | 112.2361 | | resnest101e | 64 | 3.5904 | 15.9819 | 27.2543 | nan | 112.3601 | 111.9318 | | mixnet_l | 128 | 5.5672 | 12.4956 | 26.4047 | nan | 104.4184 | 100.2507 | | tf_mixnet_l | 128 | 5.8311 | 12.8346 | 27.0176 | nan | 102.9914 | 101.7384 | | tinynet_a | 128 | 2.2501 | 8.0331 | 19.6328 | 193.1617 | 100.8057 | 99.4974 | | gluon_inception_v3 | 128 | 1.7951 | 8.4895 | 13.3253 | 182.7971 | 99.3939 | 98.0433 | | adv_inception_v3 | 128 | 1.8589 | 8.3797 | 13.4071 | 185.6317 | 98.8296 | 97.592 | | inception_v3 | 128 | 1.7955 | 8.3779 | 13.3461 | 187.2416 | 98.815 | 97.6534 | | fbnetc_100 | 128 | 2.1484 | 6.6756 | 17.219 | 133.822 | 92.8048 | 90.8669 | | dla102 | 128 | 2.0732 | 9.5362 | 15.0414 | 245.121 | 91.5245 | 88.2575 | | spnasnet_100 | 128 | 2.1139 | 6.6979 | 16.8865 | 129.6784 | 86.0289 | 85.0378 | | cspdarknet53 | 64 | 2.3949 | 7.363 | 18.7838 | 145.4289 | 86.0239 | 83.9691 | | xcit_large_24_p8_224 | 5 | 3.4402 | nan | nan | nan | 83.7889 | 81.0046 | | mobilenetv3_large_100 | 128 | 1.6659 | 5.6582 | 13.1196 | 141.4444 | 83.7645 | 83.1199 | | poolformer_m36 | 64 | 1.7814 | 7.3615 | 11.9919 | nan | 83.1921 | 81.9303 | | res2next50 | 128 | 1.7642 | 8.1217 | 13.1356 | 197.6893 | 82.5102 | 81.5968 | | tf_efficientnet_b0 | 128 | 1.9543 | 6.9028 | 15.9229 | 178.2054 | 81.514 | 80.0134 | | swin_base_patch4_window7_224 | 64 | 3.1593 | 13.8812 | nan | nan | 80.3194 | 78.064 | | sebotnet33ts_256 | 64 | 1.7375 | 6.1698 | 13.6368 | 147.8543 | 77.6077 | 77.043 | | convnext_base | 64 | 1.5937 | 7.12 | 11.82 | nan | 75.6718 | 74.322 | | gluon_xception65 | 32 | 2.2248 | 11.0555 | 17.5883 | nan | 74.5538 | 72.2286 | | mobilenetv2_100 | 128 | 1.7651 | 5.5378 | 13.1407 | 120.4764 | 74.3061 | 72.9447 | | coat_lite_mini | 128 | 1.1746 | 5.1268 | 8.5135 | 113.6565 | 74.1871 | 73.2382 | | swsl_resnext101_32x16d | 32 | 2.0012 | 9.2636 | 14.6414 | nan | 72.3633 | 68.1652 | | mnasnet_100 | 128 | 1.6627 | 5.482 | 13.0156 | 106.5568 | 72.1706 | 71.5708 | | cait_m36_384 | 4 | 3.7935 | 19.1647 | nan | nan | 70.3996 | 67.0445 | | regnety_002 | 128 | 1.7077 | 5.7558 | 12.907 | 116.8696 | 67.0447 | 67.8149 | | jx_nest_base | 32 | 2.0933 | 9.5847 | 15.0667 | nan | 65.6775 | 65.1533 | | dm_nfnet_f0 | 128 | 2.1293 | 7.1562 | 10.6189 | 163.3484 | 65.6721 | 64.1199 | | eca_botnext26ts_256 | 128 | 1.3914 | 4.974 | 10.2677 | 120.9947 | 58.9354 | 59.1506 | | visformer_small | 128 | 0.9022 | 3.9417 | 6.2183 | nan | 58.2882 | 58.109 | | botnet26t_256 | 128 | 1.3517 | 4.2582 | 9.0048 | 93.387 | 57.1379 | 55.983 | | ese_vovnet19b_dw | 128 | 0.9928 | 3.0878 | 6.526 | 66.4863 | 53.019 | 52.9377 | | selecsls42b | 128 | 0.806 | 3.7677 | 5.8199 | 88.4469 | 52.2866 | 51.7557 | | gernet_l | 128 | 2.0777 | 6.0035 | 15.1739 | 114.4098 | 50.6631 | 49.5043 | | lcnet_050 | 128 | 1.0104 | 3.2541 | 7.348 | 82.1269 | 50.4779 | 51.0361 | | nfnet_l0 | 128 | 1.9159 | 7.0968 | 10.5603 | 146.6034 | 48.0838 | 47.0381 | | gmlp_s16_224 | 128 | 1.3696 | 7.28 | 11.9494 | 196.2288 | 45.582 | 44.0901 | | crossvit_9_240 | 128 | 1.8131 | 8.4372 | 13.3206 | 203.0935 | 41.7153 | 39.7321 | | volo_d1_224 | 64 | 1.4626 | 7.5768 | 11.9767 | nan | 41.4465 | 40.9462 | | tnt_s_patch16_224 | 128 | 2.0027 | 10.5366 | nan | nan | 40.8253 | 38.5342 | | gmixer_24_224 | 128 | 1.5719 | 8.0976 | 13.3271 | 188.2144 | 37.8892 | 36.5984 | | repvgg_a2 | 128 | 2.0803 | 6.1175 | 15.7211 | 194.3249 | 37.1752 | 35.6346 | | convmixer_768_32 | 32 | 1.3704 | 6.2962 | 9.9109 | nan | 31.4347 | 30.7111 | | convit_base | 64 | 1.3213 | 5.8057 | 9.1766 | 142.6538 | 30.3977 | 29.1532 | | mixer_b16_224 | 128 | 0.7608 | 3.6725 | 5.7648 | 84.6185 | 25.7875 | 25.3672 | | resmlp_12_224 | 128 | 0.6493 | 3.0178 | 4.9424 | 53.9323 | 24.4716 | 23.7235 | | deit_base_distilled_patch16_224 | 64 | 0.9486 | 4.5059 | 7.1449 | 85.6812 | 24.2213 | 22.9072 | | pit_b_224 | 64 | 1.1061 | 5.1285 | 8.1282 | 111.214 | 23.8631 | 23.0173 | | beit_base_patch16_224 | 64 | 1.2777 | 5.4577 | nan | nan | 22.8854 | 20.9514 | | vit_base_patch16_224 | 64 | 0.9119 | 4.4513 | 7.0014 | 85.4846 | 22.7967 | 22.7966 | | eca_halonext26ts | 128 | 1.4413 | 5.026 | 10.5828 | nan | nan | 76.4004 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.1634 | 1.4912 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0842 | 1.1492 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3632 | nan | 1.0576 | 1.2943 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.045 | 1.3028 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.038 | 1.1389 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.0012 | 1.2582 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2271 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 0.9796 | 0.9842 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9745 | 1.0806 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 0.9567 | 1.1357 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9484 | 1.057 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9464 | 0.9678 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 0.9296 | 1.0969 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9291 | 0.9904 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9289 | 0.9803 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 0.9185 | 1.2283 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 0.9137 | 1.123 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9066 | 0.9846 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 0.8962 | 1.1046 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2717 | nan | 0.8815 | 0.98 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 0.8765 | 1.1944 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.8723 | 1.0162 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.8648 | 1.0056 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.852 | 0.9728 | | dpn107 | 32 | 0.997 | 0.9097 | 0.353 | nan | 0.8455 | 0.944 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8442 | 0.965 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8368 | 0.9122 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8146 | 0.9442 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8239 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.8041 | 1.0135 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8022 | 1.0085 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.7927 | 0.9534 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.787 | 0.9293 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7861 | 1.0072 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.7727 | 0.9234 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.7713 | 0.9528 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.7707 | 1.0052 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.7697 | 0.9414 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.7605 | 0.942 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.75 | 0.9635 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.7318 | 0.8133 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.7239 | 0.9334 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.7101 | 0.9306 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.6955 | 0.8352 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6706 | 0.8617 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.6615 | 0.9434 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.5858 | 0.8993 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5572 | 0.8383 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.267 | nan | nan | 1.206 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 297.6786 | 297.1247 | 321.5896 | nan | 411.388 | 416.6184 | | hrnet_w18 | 128 | 293.7138 | 290.8156 | 345.1675 | nan | 243.0218 | 252.1875 | | res2next50 | 128 | 138.1969 | 138.6713 | 166.5626 | 120.636 | 222.7227 | 244.4295 | | resnest101e | 64 | 165.0213 | 165.9986 | 200.2233 | nan | 215.6281 | 211.8944 | | dla102 | 128 | 178.8123 | 179.4018 | 213.4352 | 135.8911 | 213.681 | 218.4072 | | pnasnet5large | 16 | 215.5632 | 212.0166 | 261.2169 | nan | 201.361 | 208.9457 | | res2net50_14w_8s | 128 | 147.9108 | 147.0037 | 180.4162 | 144.5468 | 200.7088 | 237.8128 | | tf_mixnet_l | 128 | 195.1312 | 210.5092 | 240.9328 | nan | 193.2335 | 193.5492 | | mixnet_l | 128 | 186.9826 | 202.4749 | 230.5662 | nan | 186.5202 | 186.5591 | | tnt_s_patch16_224 | 128 | 363.7822 | 364.7163 | nan | nan | 182.9175 | 186.1623 | | swsl_resnext101_32x16d | 32 | 118.1523 | 120.1956 | 146.0694 | nan | 180.6271 | 176.8422 | | botnet26t_256 | 128 | 105.8237 | 106.6121 | 127.8999 | 81.2988 | 171.8622 | 161.7561 | | eca_botnext26ts_256 | 128 | 111.9999 | 135.4594 | 163.7506 | 102.5029 | 170.2004 | 175.8252 | | res2net101_26w_4s | 64 | 119.1299 | 121.0986 | 126.8222 | nan | 158.8545 | 166.5919 | | poolformer_m36 | 64 | 149.1898 | 148.9523 | 184.345 | nan | 154.9622 | 156.962 | | inception_v3 | 128 | 161.28 | 161.6572 | 188.6131 | 140.9131 | 145.2223 | 146.7475 | | gluon_inception_v3 | 128 | 161.2117 | 162.1784 | 188.9236 | 141.0664 | 144.2298 | 146.6 | | adv_inception_v3 | 128 | 161.3687 | 162.1392 | 189.2003 | 142.1511 | 139.8511 | 144.6258 | | convit_base | 64 | 181.6704 | 182.1198 | 217.7066 | 147.1992 | 138.9522 | 131.2065 | | visformer_small | 128 | 98.529 | 98.0738 | 116.7765 | nan | 138.679 | 140.6386 | | dpn107 | 32 | 113.7308 | 122.7044 | 142.7124 | nan | 138.5244 | 143.818 | | rexnet_100 | 128 | 90.8837 | 103.1835 | 127.3733 | nan | 136.3638 | 115.0069 | | gluon_xception65 | 32 | 97.7762 | 99.0349 | 129.9316 | nan | 130.1851 | 151.9211 | | fbnetv3_b | 128 | 120.7445 | 122.3308 | 150.6925 | nan | 124.1983 | 123.446 | | pit_b_224 | 64 | 155.3028 | 155.917 | 188.5588 | 159.4252 | 121.2459 | 121.9528 | | mobilevit_s | 64 | 90.0604 | 107.3758 | 133.4566 | nan | 118.6645 | 116.6575 | | sebotnet33ts_256 | 64 | 83.2748 | 96.1974 | 118.3349 | 83.4783 | 116.9329 | 110.2774 | | cait_m36_384 | 4 | 166.7928 | 168.0279 | nan | nan | 112.5832 | 116.7902 | | cspdarknet53 | 64 | 95.8745 | 96.8228 | 119.9603 | 79.1707 | 111.7947 | 111.7862 | | beit_base_patch16_224 | 64 | 135.1948 | 138.2114 | nan | nan | 111.6373 | 112.9823 | | tf_efficientnet_b0 | 128 | 90.6695 | 108.3883 | 131.1207 | 91.9003 | 111.1408 | 97.884 | | vit_base_patch16_224 | 64 | 120.573 | 121.1358 | 144.5074 | 132.2505 | 107.288 | 108.0248 | | repvgg_a2 | 128 | 79.7132 | 80.4361 | 94.7494 | 70.2405 | 101.5223 | 97.4329 | | dm_nfnet_f0 | 128 | 131.9691 | 131.8275 | 148.8774 | 142.5718 | 100.9365 | 105.676 | | swin_base_patch4_window7_224 | 64 | 147.53 | 153.3811 | nan | nan | 100.4645 | 104.1555 | | ese_vovnet19b_dw | 128 | 67.8461 | 68.3214 | 85.9349 | 58.5167 | 99.7279 | 96.6383 | | convnext_base | 64 | 121.7321 | 122.1988 | 152.1357 | nan | 98.7325 | 106.5115 | | gernet_l | 128 | 79.7524 | 80.4642 | 98.4007 | 71.1638 | 96.1353 | 98.5027 | | tinynet_a | 128 | 77.7593 | 90.8223 | 110.8777 | 92.4613 | 93.3648 | 89.9392 | | mixer_b16_224 | 128 | 118.4076 | 118.6001 | 147.7192 | 131.5715 | 93.1642 | 93.7434 | | gmlp_s16_224 | 128 | 136.2347 | 136.931 | 173.4596 | 134.5735 | 88.871 | 90.9262 | | jx_nest_base | 32 | 119.504 | 119.8261 | 148.6855 | nan | 85.715 | 87.5284 | | volo_d1_224 | 64 | 134.6483 | 135.6171 | 159.3932 | nan | 84.6412 | 86.3507 | | nfnet_l0 | 128 | 106.5024 | 130.6041 | 148.7443 | 124.6919 | 84.1567 | 89.0559 | | fbnetc_100 | 128 | 87.8651 | 88.8078 | 105.7832 | 71.7708 | 83.0216 | 77.447 | | crossvit_9_240 | 128 | 109.3502 | 109.8822 | 130.5115 | 118.8288 | 82.6433 | 83.8584 | | mobilenetv2_100 | 128 | 67.7092 | 68.3853 | 89.1442 | 59.3675 | 82.592 | 102.0014 | | gmixer_24_224 | 128 | 119.9546 | 136.23 | 166.6041 | 129.7639 | 77.4108 | 79.643 | | deit_base_distilled_patch16_224 | 64 | 94.3767 | 94.8785 | 118.0622 | 96.5312 | 73.5881 | 74.7992 | | selecsls42b | 128 | 62.7584 | 63.1726 | 74.5929 | 49.4566 | 72.0475 | 72.8163 | | spnasnet_100 | 128 | 76.5481 | 77.4204 | 93.3242 | 65.9149 | 71.0324 | 67.8155 | | twins_pcpvt_base | 64 | 124.5445 | 136.2945 | 138.0843 | nan | 70.4764 | 77.1764 | | mobilenetv3_large_100 | 128 | 65.8365 | 66.7243 | 80.5542 | 62.9951 | 64.0445 | 57.8293 | | coat_lite_mini | 128 | 115.8788 | 116.4864 | 137.3317 | 100.7138 | 63.7774 | 66.5401 | | xcit_large_24_p8_224 | 5 | 124.0746 | nan | nan | nan | 61.7183 | 79.442 | | mnasnet_100 | 128 | 70.0777 | 71.1035 | 84.8214 | 55.414 | 61.5292 | 61.8233 | | resmlp_12_224 | 128 | 68.3598 | 68.3667 | 87.5193 | 46.6616 | 59.9018 | 59.5169 | | ghostnet_100 | 128 | 95.2343 | 101.786 | 108.0335 | 94.6987 | 57.0377 | 68.1641 | | regnety_002 | 128 | 53.3292 | 56.0704 | 46.9305 | 61.7106 | 33.1263 | 50.9581 | | lcnet_050 | 128 | 33.7573 | 34.3999 | 38.789 | 32.2818 | 22.0269 | 25.1165 | | eca_halonext26ts | 128 | 116.0606 | 139.6354 | 167.7633 | nan | nan | 199.7453 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_amp.png : ![](https://i.imgur.com/MdsRWif.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/0iV2ggP.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/baNisEu.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 91%, 42/46  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 91%, 42/46  | 97%, 59/61  |
|     aot_cudagraphs     | 80%, 43/54 | 70%, 32/46  | 90%, 55/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  7%, 3/46   | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 83%, 38/46  | 89%, 54/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 85%, 39/46  | 89%, 54/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.24x    |    1.01x    |    1.00x    |
|    nvprims_nvfuser     |   1.01x    |    1.10x    |    1.09x    |
|        inductor        |   1.64x    |    1.64x    |    1.17x    |
| inductor_no_cudagraphs |   1.28x    |    1.56x    |    1.15x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.02    |    2.94     |    2.27     |
|       aot_eager        |    6.61    |    10.53    |    8.70     |
|     aot_cudagraphs     |    9.82    |    17.39    |    16.17    |
|    nvprims_nvfuser     |   63.40    |   117.61    |   147.41    |
|        inductor        |   73.45    |    37.78    |    76.47    |
| inductor_no_cudagraphs |   70.11    |    34.36    |    74.63    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.83x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.37x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.07x    |    0.86x    |
|        inductor        |   0.78x    |    0.92x    |    0.88x    |
| inductor_no_cudagraphs |   0.92x    |    1.07x    |    1.03x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Previous report name: /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_amp_195 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 83%, 45/54 | 81%, 44/54 | | inductor | huggingface | 90%, 38/42 | 83%, 38/46 | | inductor | timm_models | 92%, 56/61 | 87%, 53/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 87%, 47/54 | | inductor_no_cudagraphs | huggingface | 90%, 38/42 | 85%, 39/46 | | inductor_no_cudagraphs | timm_models | 92%, 56/61 | 89%, 54/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.89x | 1.65x | | inductor | huggingface | 1.81x | 1.64x | | inductor | timm_models | 1.43x | 1.18x | | inductor_no_cudagraphs | torchbench | 1.38x | 1.28x | | inductor_no_cudagraphs | huggingface | 1.57x | 1.57x | | inductor_no_cudagraphs | timm_models | 1.37x | 1.15x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | vision_maskrcnn | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | YituTechConvBert | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | huggingface | DebertaV2ForMaskedLM | fail_to_run | fail_to_run | | huggingface | BlenderbotForCausalLM | 0.0000 | 0.0000 | | timm_models | eca_halonext26ts | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | ghostnet_100 | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | | timm_models | hrnet_w18 | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | timm_vovnet | 1.0486 | 0.9471 | | torchbench | mobilenet_v2 | 0.8964 | 0.7974 | | torchbench | timm_regnet | 0.8961 | 0.871 | | torchbench | resnet50 | 0.8766 | 0.7954 | | torchbench | yolov3 | 0.8545 | 0.8592 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | tacotron2 | 0.0 | 0.8751 | | torchbench | hf_GPT2_large | 0.0 | 1.8592 | | torchbench | dlrm | 0.0 | 1.0793 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.2472 | 0.9077 | | huggingface | DebertaV2ForQuestionAnswering | 1.1144 | 0.931 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.173 | | huggingface | YituTechConvBert | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | fbnetc_100 | 1.0358 | 0.9115 | | timm_models | fbnetv3_b | 0.9734 | 0.9494 | | timm_models | tf_efficientnet_b0 | 0.9674 | 0.8134 | | timm_models | res2net101_26w_4s | 0.8593 | 0.7682 | | timm_models | selecsls42b | 0.8536 | 0.8584 | | timm_models | dla102 | 0.8376 | 0.8189 | | timm_models | rexnet_100 | 0.8169 | 0.6897 | | timm_models | tinynet_a | 0.811 | 0.7832 | | timm_models | gernet_l | 0.8069 | 0.8277 | | timm_models | cspdarknet53 | 0.8035 | 0.8117 | | timm_models | dpn107 | 0.7969 | 0.7537 | | timm_models | resnest101e | 0.783 | 0.8038 | | timm_models | swsl_resnext101_32x16d | 0.7688 | 0.6575 | | timm_models | mobilevit_s | 0.7518 | 0.8171 | | timm_models | gluon_xception65 | 0.7283 | 0.7193 | | timm_models | visformer_small | 0.7277 | 0.7417 | | timm_models | convmixer_768_32 | 0.7255 | 0.7432 | | timm_models | repvgg_a2 | 0.6994 | 0.7209 | | timm_models | sebotnet33ts_256 | 0.6861 | 0.6741 | | timm_models | ese_vovnet19b_dw | 0.649 | 0.7146 | | timm_models | res2net50_14w_8s | 0.6367 | 0.6068 | | timm_models | mobilenetv2_100 | 0.6111 | 0.6437 | | timm_models | eca_botnext26ts_256 | 0.6039 | 0.6495 | | timm_models | botnet26t_256 | 0.5782 | 0.599 | | timm_models | res2next50 | 0.5625 | 0.6006 | | timm_models | eca_halonext26ts | 0.0 | 0.6228 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+-----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------+------------------------+ | torchbench | yolov3 | 1095.2147 | 1081.3675 | | torchbench | densenet121 | 399.6087 | 400.9448 | | torchbench | timm_efficientdet | 290.5294 | 293.7364 | | torchbench | mobilenet_v3_large | 143.7937 | 141.8553 | | torchbench | timm_efficientnet | 120.3219 | 119.9608 | | huggingface | DebertaV2ForQuestionAnswering | 198.2625 | 56.4764 | | huggingface | DebertaV2ForMaskedLM | 181.6771 | 57.1136 | | huggingface | XLNetLMHeadModel | 134.726 | 138.8378 | | timm_models | hrnet_w18 | 207.5764 | 200.8124 | | timm_models | pnasnet5large | 199.2617 | 194.6135 | | timm_models | res2net50_14w_8s | 196.7245 | 194.4486 | | timm_models | ghostnet_100 | 194.5552 | 194.4957 | | timm_models | twins_pcpvt_base | 141.1447 | 139.377 | | timm_models | res2net101_26w_4s | 137.0176 | 132.124 | | timm_models | dpn107 | 130.6184 | 129.4981 | | timm_models | rexnet_100 | 124.0767 | 121.1943 | | timm_models | mobilevit_s | 120.6869 | 115.077 | +-------------+-------------------------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+---------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+----------+------------------------+ | torchbench | mobilenet_v2 | 0.885 | 1.0922 | | torchbench | timm_vision_transformer_large | 0.879 | 0.9542 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | Background_Matting | 0.8561 | 1.0424 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | fastNLP_Bert | 0.8521 | 1.0681 | | torchbench | hf_DistilBert | 0.8384 | 0.9048 | | torchbench | timm_regnet | 0.836 | 1.0095 | | torchbench | yolov3 | 0.8316 | 0.9829 | | torchbench | hf_Bart | 0.8224 | 1.0097 | | torchbench | shufflenet_v2_x1_0 | 0.8124 | 0.9457 | | torchbench | resnet152 | 0.8058 | 0.9398 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | pytorch_unet | 0.7877 | 0.7907 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_resnest | 0.7215 | 0.9567 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | timm_vovnet | 0.6882 | 0.8809 | | torchbench | resnet50 | 0.6745 | 0.8696 | | torchbench | mnasnet1_0 | 0.659 | 0.7666 | | torchbench | mobilenet_v3_large | 0.6582 | 0.811 | | torchbench | resnext50_32x4d | 0.651 | 0.7706 | | torchbench | squeezenet1_1 | 0.6336 | 0.7041 | | torchbench | hf_Reformer | 0.5851 | 1.0017 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | resnet18 | 0.5498 | 0.6181 | | torchbench | densenet121 | 0.5389 | 0.6133 | | torchbench | LearningToPaint | 0.4882 | 0.6195 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | dlrm | nan | 0.7306 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | DistilBertForMaskedLM | 0.8716 | 0.9439 | | huggingface | Speech2Text2ForCausalLM | 0.8672 | 0.9793 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | M2M100ForConditionalGeneration | 0.8468 | 1.023 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9649 | | huggingface | DebertaV2ForMaskedLM | 0.6117 | 0.9912 | | huggingface | MobileBertForQuestionAnswering | 0.5988 | 0.8126 | | huggingface | Reformer | 0.5813 | 1.0027 | | huggingface | DebertaV2ForQuestionAnswering | 0.5266 | 0.9885 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1614 | | timm_models | mobilenetv2_100 | 0.8962 | 1.1046 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | mixnet_l | 0.8815 | 0.98 | | timm_models | eca_botnext26ts_256 | 0.8765 | 1.1944 | | timm_models | dla102 | 0.8723 | 1.0162 | | timm_models | fbnetv3_b | 0.8648 | 1.0056 | | timm_models | adv_inception_v3 | 0.8599 | 0.9862 | | timm_models | gluon_inception_v3 | 0.8599 | 0.9862 | | timm_models | inception_v3 | 0.8599 | 0.9862 | | timm_models | swsl_resnext101_32x16d | 0.852 | 0.9728 | | timm_models | dpn107 | 0.8455 | 0.944 | | timm_models | gluon_xception65 | 0.8442 | 0.965 | | timm_models | cspdarknet53 | 0.8368 | 0.9122 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | res2net101_26w_4s | 0.8146 | 0.9442 | | timm_models | resmlp_12_224 | 0.8092 | 0.8239 | | timm_models | ese_vovnet19b_dw | 0.8041 | 1.0135 | | timm_models | convnext_base | 0.8022 | 1.0085 | | timm_models | selecsls42b | 0.7927 | 0.9534 | | timm_models | spnasnet_100 | 0.787 | 0.9294 | | timm_models | coat_lite_mini | 0.7861 | 1.0072 | | timm_models | mnasnet_100 | 0.7727 | 0.9234 | | timm_models | res2net50_14w_8s | 0.7713 | 0.9528 | | timm_models | ghostnet_100 | 0.7707 | 1.0052 | | timm_models | res2next50 | 0.7697 | 0.9414 | | timm_models | hrnet_w18 | 0.7605 | 0.942 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | mobilenetv3_large_100 | 0.7499 | 0.9634 | | timm_models | sebotnet33ts_256 | 0.7318 | 0.8133 | | timm_models | gernet_l | 0.7239 | 0.9334 | | timm_models | fbnetc_100 | 0.7101 | 0.9306 | | timm_models | lcnet_050 | 0.6955 | 0.8352 | | timm_models | jx_nest_base | 0.6706 | 0.8617 | | timm_models | botnet26t_256 | 0.6615 | 0.9434 | | timm_models | regnety_002 | 0.5858 | 0.8993 | | timm_models | repvgg_a2 | 0.5572 | 0.8383 | +-------------+---------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Compilation latency (sec) regressions ~~~ +------------------------+------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+------------+ | inductor | XLNetLMHeadModel | 115.9403 | 134.726 | | inductor_no_cudagraphs | XLNetLMHeadModel | 118.8197 | 138.8378 | +------------------------+------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Performance speedup regressions ~~~ +------------------------+------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------+-------------+------------+ | inductor_no_cudagraphs | fbnetv3_b | 0.9635 | 0.9494 | | inductor_no_cudagraphs | fbnetc_100 | 1.0805 | 0.9115 | +------------------------+------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +----------+-------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+-------------+-------------+------------+ | inductor | mobilevit_s | 117.0541 | 120.6869 | +----------+-------------+-------------+------------+ ~~~ No regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 0.9235 | 2.5012 | 0.7272 | 4.9772 | 1.1386 | | timm_efficientdet | 1 | 0.9851 | 0.8086 | 2.1191 | 0.0 | 3.9742 | 1.3867 | | BERT_pytorch | 16 | 1.0128 | 0.8327 | 1.7003 | 0.7688 | 3.4424 | 2.3455 | | timm_vision_transformer | 8 | 1.0031 | 0.8539 | 1.8912 | 0.5921 | 3.1169 | 1.5792 | | dcgan | 32 | 0.9857 | 0.9162 | 1.6667 | 0.717 | 2.9151 | 1.0531 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9957 | 0.9846 | 1.7225 | 0.0 | 2.7348 | 1.5256 | | resnet18 | 16 | 0.9991 | 0.9958 | 1.6285 | 0.7985 | 2.6458 | 1.1827 | | hf_T5_large | 2 | 1.0213 | 0.8755 | 0.0 | 0.0 | 2.5504 | 2.1994 | | hf_Albert | 8 | 1.0019 | 0.9611 | 0.7748 | 0.0 | 2.3796 | 2.3197 | | mobilenet_v3_large | 32 | 1.0032 | 1.0005 | 1.4817 | 0.7625 | 2.2072 | 1.1414 | | resnext50_32x4d | 8 | 1.0032 | 0.9563 | 1.8887 | 0.7408 | 2.1165 | 1.0276 | | pytorch_struct | 200 | 0.9998 | 0.7441 | 1.0843 | 0.6248 | 2.113 | 1.2709 | | squeezenet1_1 | 32 | 0.9975 | 0.9617 | 1.4451 | 0.7175 | 2.1077 | 1.1654 | | lennard_jones | 1000 | 0.9659 | 0.7677 | 1.2685 | 0.5334 | 2.0885 | 1.0674 | | hf_Bert | 4 | 1.0335 | 0.8439 | 0.947 | 0.0 | 2.075 | 1.8619 | | drq | 1 | 0.9964 | 0.7964 | 1.969 | 0.593 | 2.0604 | 1.1474 | | hf_GPT2 | 4 | 1.0204 | 0.9785 | 0.8234 | 0.2754 | 1.9178 | 1.9134 | | hf_T5 | 8 | 0.9999 | 0.9239 | 0.0 | 1.3513 | 1.8819 | 1.892 | | mnasnet1_0 | 32 | 0.9977 | 1.0187 | 1.2739 | 0.7651 | 1.8795 | 1.1112 | | hf_Bart | 4 | 1.0135 | 0.8416 | 0.8798 | 0.0 | 1.7483 | 1.7468 | | soft_actor_critic | 256 | 0.9892 | 0.7492 | 1.306 | 0.5337 | 1.7138 | 1.0328 | | LearningToPaint | 96 | 0.9996 | 1.012 | 1.2519 | 0.8349 | 1.6866 | 1.2339 | | shufflenet_v2_x1_0 | 128 | 1.0002 | 1.0158 | 0.981 | 0.8506 | 1.6507 | 1.2293 | | attention_is_all_you_need_pytorch | 256 | 1.0092 | 0.9155 | 0.899 | 0.0 | 1.5886 | 1.4838 | | speech_transformer | 32 | 0.9984 | 0.8571 | 1.7537 | 0.6362 | 1.5652 | 1.5648 | | fastNLP_Bert | 6 | 0.9978 | 0.9037 | 0.8123 | 0.0 | 1.5242 | 1.4707 | | hf_DistilBert | 8 | 1.0006 | 0.9738 | 0.7423 | 0.3417 | 1.5124 | 1.4801 | | timm_efficientnet | 32 | 0.96 | 0.811 | 1.0929 | 0.6824 | 1.5119 | 1.0025 | | pytorch_stargan | 16 | 0.995 | 1.1028 | 1.0338 | 0.0 | 1.4647 | 1.401 | | pytorch_unet | 1 | 0.9993 | 0.2121 | 0.0 | 0.0 | 1.3526 | 1.3241 | | timm_nfnet | 128 | 0.999 | 0.9995 | 0.8821 | 0.9206 | 1.2911 | 1.2393 | | vgg16 | 64 | 0.9993 | 0.9969 | 0.8571 | 0.9733 | 1.2781 | 1.2693 | | resnet152 | 32 | 0.9997 | 1.0024 | 1.2722 | 0.0 | 1.247 | 1.0037 | | Super_SloMo | 6 | 0.9996 | 0.177 | 0.0 | 0.0 | 1.2281 | 1.1945 | | alexnet | 128 | 0.999 | 0.998 | 0.8153 | 0.9224 | 1.2082 | 1.207 | | timm_resnest | 32 | 1.007 | 1.0201 | 0.8683 | 0.9636 | 1.1883 | 1.0091 | | hf_Reformer | 4 | 0.9976 | 1.0008 | 0.9925 | 0.6515 | 1.1775 | 1.1734 | | Background_Matting | 4 | 1.0001 | 0.1454 | 0.0 | 0.0 | 1.1256 | 1.1027 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9903 | 0.0 | 0.0 | 1.1103 | 1.0919 | | timm_vovnet | 32 | 0.919 | 0.8931 | 0.9187 | 0.8041 | 1.0486 | 0.9471 | | tts_angular | 64 | 0.9759 | 0.9438 | 0.971 | 0.952 | 1.0069 | 1.0197 | | demucs | 4 | 1.0006 | 0.9985 | 1.0018 | 0.9983 | 0.9987 | 1.0006 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9969 | 0.6966 | 1.0072 | 0.989 | 1.0306 | | mobilenet_v2 | 96 | 0.9996 | 0.9891 | 0.7604 | 1.0281 | 0.8964 | 0.7974 | | timm_regnet | 32 | 0.9776 | 0.9398 | 0.9955 | 0.7801 | 0.8961 | 0.871 | | resnet50 | 32 | 1.0017 | 1.0117 | 1.1259 | 0.8113 | 0.8766 | 0.7954 | | yolov3 | 16 | 0.9996 | 0.99 | 0.8051 | 0.0 | 0.8545 | 0.8592 | | functorch_dp_cifar10 | 64 | 1.0005 | 0.9513 | 2.2485 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9601 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | tacotron2 | 64 | 0.968 | 0.7604 | 0.9754 | 0.5994 | 0.0 | 0.8751 | | hf_GPT2_large | 4 | 1.0002 | 0.9905 | 0.0 | 0.0 | 0.0 | 1.8592 | | dlrm | 2048 | 1.0831 | 1.0906 | 0.0 | 1.1023 | 0.0 | 1.0793 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | yolov3 | 16 | 3.134 | 8.3485 | 11.8312 | nan | 1095.2147 | 1081.3675 | | densenet121 | 4 | 2.4551 | 11.9254 | 19.3745 | 238.249 | 399.6087 | 400.9448 | | timm_efficientdet | 1 | 20.0748 | 37.2082 | 77.3012 | nan | 290.5294 | 293.7364 | | mobilenet_v3_large | 32 | 1.0517 | 4.8262 | 7.2363 | 120.475 | 143.7937 | 141.8553 | | timm_efficientnet | 32 | 1.9233 | 6.7602 | 15.7626 | 149.7354 | 120.3219 | 119.9608 | | mnasnet1_0 | 32 | 0.9953 | 4.3585 | 6.5507 | 87.7722 | 118.7061 | 116.5561 | | hf_T5_large | 2 | 14.7545 | 40.6947 | nan | nan | 118.7004 | 117.5833 | | resnext50_32x4d | 8 | 1.0313 | 4.6014 | 7.1613 | 81.0849 | 105.0649 | 104.0637 | | resnet152 | 32 | 2.7302 | 13.4314 | 22.2487 | nan | 104.8761 | 102.0057 | | timm_vovnet | 32 | 1.6202 | 4.5438 | 10.4175 | 70.4366 | 92.5113 | 92.5444 | | mobilenet_v2 | 96 | 0.9374 | 4.5817 | 7.1092 | 111.6778 | 78.1857 | 77.6467 | | shufflenet_v2_x1_0 | 128 | 1.1157 | 5.0476 | 7.6952 | 104.009 | 76.445 | 76.1455 | | resnet50 | 32 | 0.988 | 4.5599 | 7.207 | 98.438 | 72.7939 | 72.4359 | | timm_vision_transformer_large | 8 | 3.1135 | 14.9532 | nan | nan | 70.2686 | 68.9305 | | timm_nfnet | 128 | 2.1613 | 7.1471 | 10.5832 | 163.6933 | 64.757 | 64.2603 | | timm_regnet | 32 | 2.4441 | 8.0633 | 20.0674 | 141.4078 | 63.5851 | 63.0572 | | timm_resnest | 32 | 0.6708 | 2.5118 | 4.0924 | 67.0756 | 59.7944 | 60.5418 | | squeezenet1_1 | 32 | 0.2731 | 0.9335 | 1.4041 | 6.6783 | 53.9548 | 53.511 | | resnet18 | 16 | 0.4894 | 1.8093 | 2.6452 | 37.8414 | 41.338 | 41.119 | | LearningToPaint | 96 | 0.4955 | 1.9291 | 3.0728 | 46.5876 | 34.3305 | 34.354 | | timm_vision_transformer | 8 | 1.033 | 4.5971 | 6.9 | 88.697 | 34.1412 | 34.111 | | hf_Bart | 4 | 2.0363 | 8.8906 | 13.8831 | nan | 32.7831 | 32.4179 | | BERT_pytorch | 16 | 1.8241 | 7.6201 | 11.4507 | 105.4626 | 31.4894 | 31.5765 | | attention_is_all_you_need_pytorch | 256 | 1.4401 | 7.0322 | 11.7053 | nan | 31.3277 | 30.6045 | | hf_T5 | 8 | 2.7396 | 8.9164 | nan | 93.8054 | 30.1908 | 28.9888 | | Background_Matting | 4 | 1.0492 | 8.7926 | nan | nan | 29.3679 | 29.1396 | | fastNLP_Bert | 6 | 1.8579 | 7.2501 | 11.9817 | nan | 28.9541 | 27.2169 | | pytorch_stargan | 16 | 0.4375 | 1.9872 | 2.8822 | nan | 27.1058 | 26.6409 | | speech_transformer | 32 | 2.0096 | 9.0558 | 34.0506 | 182.7663 | 27.1006 | 26.0147 | | pytorch_struct | 200 | 0.2985 | 0.8552 | 1.6431 | 7.4518 | 22.3521 | 22.7492 | | hf_Bert | 4 | 1.8776 | 7.1929 | 10.2216 | nan | 21.0045 | 20.1359 | | Super_SloMo | 6 | 1.1667 | 7.2032 | nan | nan | 20.8976 | 20.2818 | | hf_Albert | 8 | 1.6614 | 6.6574 | 10.2219 | nan | 20.0628 | 19.3648 | | hf_GPT2 | 4 | 1.7385 | 6.668 | 9.389 | 88.9934 | 20.0197 | 19.5092 | | hf_Reformer | 4 | 1.6872 | 2.8935 | 5.5898 | 16.7964 | 18.2663 | 15.9925 | | hf_DistilBert | 8 | 0.858 | 3.5287 | 5.9638 | 57.5172 | 13.533 | 13.5271 | | pytorch_unet | 1 | 0.5084 | 3.0753 | nan | nan | 11.9356 | 11.5376 | | dcgan | 32 | 0.182 | 0.4283 | 0.669 | 5.0323 | 10.4217 | 10.0519 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4677 | 1.9927 | 2.8247 | nan | 8.8841 | 9.0972 | | drq | 1 | 0.34 | 0.6624 | 1.0652 | 6.141 | 4.3275 | 3.6795 | | vgg16 | 64 | 0.2154 | 0.6729 | 1.021 | 5.4391 | 4.0071 | 3.7176 | | nvidia_deeprecommender | 256 | 0.2204 | 0.5291 | 0.8263 | 5.6768 | 3.8066 | 3.4753 | | soft_actor_critic | 256 | 0.2269 | 0.3759 | 0.607 | 3.2756 | 3.5464 | 3.0456 | | alexnet | 128 | 0.1763 | 0.449 | 0.7219 | 4.8412 | 3.3142 | 3.0785 | | lennard_jones | 1000 | 0.1705 | 0.3522 | 0.5293 | 3.0422 | 2.1956 | 1.9932 | | tts_angular | 64 | 0.1973 | 0.2484 | 0.3779 | 1.4794 | 1.95 | 1.7714 | | demucs | 4 | 0.3595 | 0.354 | 0.3548 | 0.3545 | 0.2637 | 0.2847 | | hf_GPT2_large | 4 | 5.8806 | 20.258 | nan | nan | nan | 52.5003 | | tacotron2 | 64 | 5.0828 | 17.9354 | 33.3274 | 87.6773 | nan | 43.908 | | dlrm | 2048 | 0.4737 | 0.8348 | nan | 4.8473 | nan | 3.4167 | | functorch_dp_cifar10 | 64 | 0.3484 | 1.4056 | 2.1954 | nan | nan | nan | | hf_BigBird | 2 | 4.0068 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3343 | 1.1938 | 1.0923 | 1.0983 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.024 | 1.176 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3084 | nan | 0.9837 | 1.1225 | | BERT_pytorch | 16 | 1.0003 | 0.8825 | 0.4001 | 1.1128 | 0.9743 | 1.1216 | | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2716 | 0.4638 | 0.9696 | 1.2228 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.3799 | 1.1206 | 0.9649 | 1.1241 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9647 | 1.2945 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0499 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.9309 | 1.252 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4816 | 0.9298 | 1.0969 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 0.885 | 1.0922 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | nan | 0.879 | 0.9542 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3902 | nan | 0.8735 | 0.942 | | Background_Matting | 4 | 1.0138 | 0.6522 | nan | nan | 0.8561 | 1.0424 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3701 | nan | 0.8521 | 1.0681 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3413 | 1.0708 | 0.8384 | 0.9048 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.836 | 1.0095 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.8316 | 0.9829 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.8224 | 1.0097 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8124 | 0.9457 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.8058 | 0.9398 | | alexnet | 128 | 0.951 | 0.7753 | 0.4793 | 0.775 | 0.7973 | 1.0079 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.7877 | 0.7907 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.734 | 0.7633 | 1.0588 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.7215 | 0.9567 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3917 | 1.0881 | 0.7151 | 0.7249 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3409 | 0.7755 | 0.6882 | 0.8809 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3559 | 0.7806 | 0.6745 | 0.8696 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.659 | 0.7666 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3445 | 0.7921 | 0.6582 | 0.811 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3884 | 0.81 | 0.651 | 0.7706 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.6336 | 0.7041 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0017 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3941 | 0.7314 | 0.5498 | 0.6181 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.5389 | 0.6133 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.4882 | 0.6195 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | 0.7306 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 183.8959 | 186.1366 | nan | nan | 165.9434 | 168.7193 | | Background_Matting | 4 | 134.009 | 917.3248 | nan | nan | 118.6561 | 120.9338 | | timm_nfnet | 128 | 131.4888 | 131.3932 | 149.0993 | 142.8237 | 101.4271 | 106.0537 | | hf_T5 | 8 | 174.4437 | 188.4253 | nan | 129.2268 | 92.682 | 92.2092 | | hf_T5_large | 2 | 222.7954 | 288.8347 | nan | nan | 88.8981 | 104.4206 | | timm_regnet | 32 | 73.1219 | 75.6414 | 81.7811 | 92.7217 | 81.0701 | 83.7383 | | yolov3 | 16 | 68.4303 | 69.3157 | 85.3786 | nan | 80.4882 | 79.6424 | | resnet152 | 32 | 91.3029 | 93.1947 | 73.1804 | nan | 76.1231 | 95.0104 | | hf_Reformer | 4 | 82.3475 | 82.1115 | 83.1249 | 125.9114 | 69.75 | 70.2174 | | Super_SloMo | 6 | 79.2116 | 447.9775 | nan | nan | 64.4475 | 66.2032 | | demucs | 4 | 58.4853 | 57.296 | 56.7208 | 57.1674 | 56.8319 | 56.8994 | | mobilenet_v2 | 96 | 48.9189 | 49.3577 | 64.3643 | 47.5012 | 54.6298 | 61.2715 | | vgg16 | 64 | 65.946 | 66.2932 | 77.0334 | 67.9781 | 52.0285 | 52.4107 | | timm_efficientdet | 1 | 160.4251 | 195.4788 | 78.2072 | nan | 42.1974 | 121.8818 | | speech_transformer | 32 | 59.9293 | 78.1035 | 35.2842 | 93.3656 | 40.113 | 39.5408 | | resnet50 | 32 | 33.0562 | 32.8566 | 32.4627 | 41.7138 | 39.0699 | 43.7437 | | fastNLP_Bert | 6 | 55.6501 | 61.7822 | 75.1471 | nan | 36.4738 | 38.0243 | | attention_is_all_you_need_pytorch | 256 | 52.8771 | 57.2243 | 64.4196 | nan | 34.0385 | 35.4652 | | hf_Bart | 4 | 54.7445 | 67.8413 | 65.1056 | nan | 33.1366 | 34.5117 | | timm_efficientnet | 32 | 48.595 | 57.1715 | 43.386 | 68.6567 | 31.967 | 47.2654 | | timm_vovnet | 32 | 34.6849 | 36.0937 | 38.5358 | 39.6984 | 30.8784 | 34.4547 | | pytorch_unet | 1 | 39.9382 | 188.4861 | nan | nan | 29.5133 | 30.1178 | | hf_Albert | 8 | 68.388 | 72.345 | 88.408 | nan | 28.7093 | 29.3861 | | shufflenet_v2_x1_0 | 128 | 41.0993 | 39.3995 | 41.5458 | 47.954 | 25.8975 | 33.1085 | | hf_GPT2 | 4 | 48.615 | 50.6486 | 60.2339 | 180.5103 | 25.4266 | 25.9595 | | hf_Bert | 4 | 39.7901 | 58.5156 | 44.1921 | nan | 21.0094 | 23.0711 | | timm_resnest | 32 | 24.5935 | 23.9317 | 30.0417 | 25.7094 | 20.9578 | 24.7455 | | hf_DistilBert | 8 | 31.1314 | 32.0089 | 41.8905 | 91.0749 | 20.5531 | 20.931 | | mobilenet_v3_large | 32 | 35.3507 | 35.2462 | 23.9251 | 47.6584 | 16.7037 | 31.8025 | | mnasnet1_0 | 32 | 33.7249 | 28.491 | 23.1534 | 38.4342 | 16.4544 | 26.8832 | | BERT_pytorch | 16 | 53.6579 | 64.5292 | 35.3809 | 69.3401 | 16.1317 | 28.7421 | | densenet121 | 4 | 73.7988 | 79.5697 | 30.561 | 102.0255 | 15.278 | 68.0473 | | resnext50_32x4d | 8 | 28.8347 | 30.1384 | 17.1158 | 39.6003 | 14.1625 | 30.2683 | | pytorch_stargan | 16 | 16.1836 | 14.4003 | 15.4747 | nan | 10.9264 | 11.4087 | | nvidia_deeprecommender | 256 | 10.3887 | 10.4041 | 14.8755 | 10.303 | 10.4748 | 10.0553 | | timm_vision_transformer | 8 | 28.9968 | 34.8567 | 17.2255 | 50.6399 | 9.8848 | 19.9614 | | LearningToPaint | 96 | 14.7931 | 14.7842 | 12.799 | 18.1226 | 8.8943 | 12.1158 | | alexnet | 128 | 9.7922 | 9.8257 | 12.0231 | 10.6383 | 8.1041 | 8.1293 | | squeezenet1_1 | 32 | 14.4163 | 15.2055 | 10.1423 | 21.0032 | 7.2066 | 14.2304 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.1708 | 17.9925 | 10.2159 | nan | 6.7301 | 14.2976 | | tts_angular | 64 | 7.0213 | 7.0314 | 6.6096 | 6.486 | 6.4254 | 6.3264 | | resnet18 | 16 | 13.5749 | 12.905 | 8.0824 | 16.7476 | 4.9972 | 12.2808 | | pytorch_struct | 200 | 4.62 | 6.0456 | 5.0616 | 8.4798 | 2.2713 | 3.6972 | | drq | 1 | 3.9204 | 5.4871 | 1.9909 | 6.4704 | 2.2341 | 3.4608 | | dcgan | 32 | 3.1456 | 3.3866 | 1.9267 | 4.4245 | 1.1048 | 2.9899 | | soft_actor_critic | 256 | 1.4216 | 1.9604 | 1.1264 | 2.7523 | 0.864 | 1.443 | | lennard_jones | 1000 | 1.6753 | 1.9497 | 1.1779 | 3.9294 | 0.7261 | 1.4441 | | tacotron2 | 64 | 3127.7052 | 3933.2329 | 3096.5187 | 5003.0861 | nan | 3485.6686 | | dlrm | 2048 | 473.2604 | 470.163 | nan | 526.2215 | nan | 512.0995 | | hf_GPT2_large | 4 | 209.5302 | 211.6916 | nan | nan | nan | 112.6548 | | functorch_dp_cifar10 | 64 | 14.0609 | 14.6503 | 6.9536 | nan | nan | nan | | hf_BigBird | 2 | 197.8019 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 1.0201 | 0.8329 | 1.2588 | 0.0 | 2.9332 | 1.8062 | | MT5ForConditionalGeneration | 16 | 1.0207 | 0.8776 | 1.102 | 0.8728 | 2.4109 | 2.0414 | | OPTForCausalLM | 2 | 0.9995 | 0.9444 | 0.0 | 0.7935 | 2.3949 | 2.3801 | | GPT2ForSequenceClassification | 4 | 1.0018 | 0.9692 | 0.0 | 0.4579 | 2.3097 | 2.2707 | | MobileBertForQuestionAnswering | 128 | 1.0217 | 0.8345 | 0.9915 | 0.0 | 2.2453 | 1.7966 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9802 | 0.7695 | 0.0 | 2.1113 | 2.0532 | | XLNetLMHeadModel | 8 | 0.9995 | 0.9711 | 0.0 | 0.0 | 1.9217 | 1.9205 | | LayoutLMForSequenceClassification | 16 | 0.9998 | 0.9792 | 0.7765 | 0.0 | 1.8486 | 1.8007 | | ElectraForCausalLM | 32 | 0.9994 | 0.9439 | 0.7164 | 0.0 | 1.8211 | 1.824 | | BertForQuestionAnswering | 16 | 0.9996 | 0.9809 | 0.7716 | 0.0 | 1.8156 | 1.7727 | | RobertaForQuestionAnswering | 16 | 0.9995 | 0.9802 | 0.7736 | 0.0 | 1.8142 | 1.764 | | XGLMForCausalLM | 8 | 1.0101 | 0.8382 | 0.9391 | 0.0 | 1.7717 | 1.5909 | | RobertaForCausalLM | 16 | 1.0004 | 0.9715 | 0.7615 | 0.0 | 1.7016 | 1.6839 | | DistillGPT2 | 16 | 0.9995 | 0.9686 | 0.7635 | 0.7441 | 1.6873 | 1.7156 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8854 | 0.0 | 0.0 | 1.6719 | 1.6613 | | PLBartForConditionalGeneration | 4 | 0.9998 | 0.9551 | 0.7371 | 0.0 | 1.6599 | 1.6562 | | AlbertForMaskedLM | 4 | 1.0002 | 0.8854 | 0.0 | 0.0 | 1.6579 | 1.6498 | | MegatronBertForQuestionAnswering | 8 | 0.9996 | 0.9874 | 0.7721 | 0.0 | 1.6208 | 1.5806 | | M2M100ForConditionalGeneration | 16 | 1.0081 | 0.8529 | 0.8756 | 0.7026 | 1.6177 | 1.4833 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.9679 | 0.7571 | 0.0 | 1.6176 | 1.599 | | MegatronBertForCausalLM | 4 | 1.0182 | 0.9161 | 0.8292 | 0.0 | 1.6074 | 1.5624 | | T5Small | 4 | 1.0053 | 0.9205 | 0.7555 | 1.1626 | 1.6048 | 1.6625 | | T5ForConditionalGeneration | 4 | 1.0005 | 0.922 | 0.7577 | 1.1535 | 1.6027 | 1.6661 | | BertForMaskedLM | 16 | 1.0001 | 0.9714 | 0.7529 | 0.0 | 1.6001 | 1.5846 | | PLBartForCausalLM | 8 | 1.0003 | 0.9675 | 0.7639 | 0.9816 | 1.5901 | 1.6058 | | CamemBert | 16 | 1.0 | 0.9738 | 0.7658 | 0.0 | 1.5305 | 1.5219 | | BartForConditionalGeneration | 2 | 1.0046 | 0.9701 | 0.0 | 0.0 | 1.5273 | 1.4236 | | DistilBertForQuestionAnswering | 256 | 0.9997 | 0.9949 | 0.7576 | 0.6657 | 1.51 | 1.4947 | | BlenderbotSmallForConditionalGeneration | 64 | 1.006 | 0.9258 | 0.7372 | 0.0 | 1.4953 | 1.4182 | | MBartForConditionalGeneration | 2 | 1.0037 | 0.9684 | 0.0 | 0.7788 | 1.4695 | 1.4335 | | MBartForCausalLM | 4 | 0.9999 | 0.9686 | 0.7584 | 0.9938 | 1.4326 | 1.4179 | | BartForCausalLM | 4 | 1.0011 | 0.9695 | 0.7576 | 0.0 | 1.4222 | 1.4236 | | Speech2Text2ForCausalLM | 256 | 0.9979 | 0.9466 | 0.6879 | 0.9035 | 1.3396 | 1.3621 | | DebertaForMaskedLM | 4 | 0.8909 | 0.7162 | 0.7872 | 0.0 | 1.2699 | 1.1377 | | PegasusForConditionalGeneration | 32 | 1.0072 | 0.9588 | 0.0 | 0.7982 | 1.2686 | 1.2676 | | TrOCRForCausalLM | 32 | 0.9997 | 0.9637 | 0.0 | 0.0 | 1.2509 | 1.2591 | | DebertaV2ForMaskedLM | 1 | 0.8791 | 0.6911 | 0.7813 | 0.0 | 1.2472 | 0.9077 | | DistilBertForMaskedLM | 128 | 0.9994 | 0.96 | 0.7145 | 0.6514 | 1.2224 | 1.237 | | BlenderbotSmallForCausalLM | 64 | 1.0005 | 0.9283 | 0.7217 | 0.0 | 1.2131 | 1.2271 | | Reformer | 16 | 0.9998 | 1.0003 | 0.9798 | 0.9397 | 1.1814 | 1.1967 | | PegasusForCausalLM | 32 | 1.002 | 0.9512 | 0.7504 | 0.8571 | 1.164 | 1.1684 | | DebertaForQuestionAnswering | 8 | 0.9738 | 0.8577 | 0.724 | 0.0 | 1.1435 | 1.2202 | | DebertaV2ForQuestionAnswering | 2 | 0.8958 | 0.6974 | 0.0 | 0.0 | 1.1144 | 0.931 | | BlenderbotForCausalLM | 4 | 1.0115 | 0.0 | 0.0 | 0.0 | 0.0 | 1.173 | | YituTechConvBert | 16 | 1.0003 | 0.9649 | 0.7905 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | DebertaV2ForMaskedLM | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | BlenderbotForCausalLM | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaV2ForQuestionAnswering | 2 | 8.885 | 18.716 | nan | nan | 198.2625 | 56.4764 | | DebertaV2ForMaskedLM | 1 | 8.7782 | 18.859 | 79.508 | nan | 181.6771 | 57.1136 | | XLNetLMHeadModel | 8 | 4.926 | 20.2914 | nan | nan | 134.726 | 138.8378 | | DebertaForQuestionAnswering | 8 | 5.0045 | 11.114 | 35.6909 | nan | 103.1909 | 39.5539 | | DebertaForMaskedLM | 4 | 4.9021 | 11.0475 | 35.7023 | nan | 98.8866 | 39.0646 | | XGLMForCausalLM | 8 | 3.1048 | 13.7758 | 28.3347 | nan | 78.1332 | 75.6486 | | MobileBertForQuestionAnswering | 128 | 10.0228 | 33.228 | 58.9471 | nan | 72.7888 | 71.0537 | | MobileBertForMaskedLM | 64 | 9.8455 | 32.6935 | 58.4768 | nan | 71.1132 | 71.6519 | | M2M100ForConditionalGeneration | 16 | 4.1696 | 17.0736 | 29.2473 | 325.3531 | 63.5739 | 61.524 | | MT5ForConditionalGeneration | 16 | 3.906 | 13.0414 | 21.4257 | 156.2895 | 55.6215 | 54.7232 | | PegasusForConditionalGeneration | 32 | 3.838 | 17.2384 | nan | 393.1795 | 53.605 | 50.2826 | | BartForConditionalGeneration | 2 | 3.9472 | 16.9732 | nan | nan | 52.9803 | 50.8612 | | MBartForConditionalGeneration | 2 | 3.9593 | 17.1209 | nan | 417.587 | 51.1931 | 51.5109 | | MegatronBertForCausalLM | 4 | 3.9469 | 14.4662 | 23.7885 | nan | 41.051 | 40.3343 | | MegatronBertForQuestionAnswering | 8 | 3.9305 | 14.8601 | 22.9434 | nan | 40.7529 | 39.7247 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4927 | 11.0522 | 18.8386 | nan | 35.3295 | 34.2441 | | T5ForConditionalGeneration | 4 | 2.7092 | 8.8144 | 13.6105 | 98.7235 | 31.9196 | 31.2607 | | T5Small | 4 | 2.7349 | 9.1022 | 13.7395 | 97.8311 | 31.306 | 30.4928 | | PLBartForConditionalGeneration | 4 | 2.0337 | 9.0631 | 13.8251 | nan | 30.3782 | 30.8577 | | LayoutLMForSequenceClassification | 16 | 2.2507 | 7.5371 | 11.5123 | nan | 28.8052 | 27.6798 | | ElectraForCausalLM | 32 | 2.0215 | 7.2268 | 11.3179 | nan | 27.1951 | 25.7692 | | PegasusForCausalLM | 32 | 1.5386 | 6.5432 | 10.3693 | 106.3195 | 26.258 | 24.4877 | | MBartForCausalLM | 4 | 1.5435 | 6.7166 | 10.0768 | 114.2072 | 24.318 | 23.6535 | | BartForCausalLM | 4 | 1.536 | 6.6389 | 10.1135 | nan | 23.6349 | 23.0314 | | LayoutLMForMaskedLM | 16 | 2.2942 | 7.8784 | 11.6916 | nan | 23.3865 | 22.4107 | | TrOCRForCausalLM | 32 | 1.4302 | 6.5231 | nan | nan | 23.0112 | 22.5272 | | BertForMaskedLM | 16 | 1.9167 | 7.1559 | 10.82 | nan | 22.1858 | 21.4338 | | ElectraForQuestionAnswering | 64 | 1.9153 | 7.1505 | 10.9369 | nan | 22.1044 | 21.0668 | | RobertaForCausalLM | 16 | 1.9131 | 7.4152 | 10.9707 | nan | 21.9734 | 21.287 | | OPTForCausalLM | 2 | 1.5703 | 7.0168 | nan | 109.4645 | 21.9101 | 21.3855 | | BertForQuestionAnswering | 16 | 1.9257 | 7.1161 | 10.6121 | nan | 21.6977 | 20.7658 | | CamemBert | 16 | 1.9612 | 7.2699 | 10.8549 | nan | 21.0575 | 20.4185 | | RobertaForQuestionAnswering | 16 | 1.9396 | 7.1994 | 10.7177 | nan | 20.4876 | 19.6343 | | GPT2ForSequenceClassification | 4 | 1.7262 | 6.801 | nan | 87.1217 | 19.199 | 18.8963 | | AlbertForMaskedLM | 4 | 1.7106 | 6.9327 | nan | nan | 19.1015 | 18.5486 | | AlbertForQuestionAnswering | 4 | 1.6901 | 6.704 | nan | nan | 19.0669 | 18.6257 | | Reformer | 16 | 1.5387 | 2.9983 | 5.9435 | 18.248 | 18.4933 | 14.8571 | | BlenderbotSmallForCausalLM | 64 | 1.0349 | 4.3933 | 6.6139 | nan | 16.9134 | 16.4032 | | Speech2Text2ForCausalLM | 256 | 0.8709 | 3.5381 | 5.496 | 52.5225 | 15.522 | 14.2512 | | PLBartForCausalLM | 8 | 0.8748 | 3.543 | 5.2023 | 66.9334 | 14.5921 | 14.164 | | DistillGPT2 | 16 | 0.939 | 3.4709 | 5.0795 | 43.5821 | 13.6855 | 13.4899 | | DistilBertForMaskedLM | 128 | 0.8433 | 3.5536 | 5.9683 | 57.1822 | 12.5922 | 12.4181 | | DistilBertForQuestionAnswering | 256 | 0.8802 | 3.4834 | 6.0322 | 56.2573 | 11.9566 | 11.7782 | | BlenderbotForCausalLM | 4 | 2.8524 | nan | nan | nan | nan | 42.5771 | | YituTechConvBert | 16 | 2.7004 | 10.7629 | 17.3688 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9997 | 0.9183 | nan | 1.2641 | 1.2906 | 1.345 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9161 | nan | 1.2229 | 1.0775 | 1.1712 | | MBartForCausalLM | 4 | 1.0 | 0.8998 | 0.3747 | 1.3748 | 1.0747 | 1.1342 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | XLNetLMHeadModel | 8 | 0.9999 | 0.9214 | nan | nan | 1.0303 | 1.0303 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9035 | nan | 1.3227 | 1.0148 | 1.2186 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | PegasusForConditionalGeneration | 32 | 0.9979 | 0.9502 | nan | 1.2087 | 1.0039 | 1.1394 | | RobertaForQuestionAnswering | 16 | 1.004 | 0.9315 | 0.3619 | nan | 1.0036 | 1.0618 | | BertForQuestionAnswering | 16 | 1.004 | 0.9312 | 0.3618 | nan | 1.0029 | 1.0617 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9976 | 1.1976 | | DistilBertForQuestionAnswering | 256 | 1.0112 | 0.9568 | 0.3185 | 1.1483 | 0.9806 | 1.0864 | | DistillGPT2 | 16 | 1.0 | 0.8673 | 0.3596 | 1.1412 | 0.9755 | 1.0618 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1321 | 0.9708 | 1.0363 | | T5Small | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9662 | 1.1856 | | PLBartForConditionalGeneration | 4 | 0.9997 | 0.9325 | 0.3746 | nan | 0.9651 | 1.0848 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | MegatronBertForQuestionAnswering | 8 | 1.0006 | 0.9101 | 0.3721 | nan | 0.9562 | 1.0239 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | BertForMaskedLM | 16 | 1.0001 | 0.9237 | 0.3656 | nan | 0.9481 | 0.9849 | | RobertaForCausalLM | 16 | 1.0 | 0.9237 | 0.3654 | nan | 0.9475 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9212 | 0.3657 | nan | 0.9446 | 0.983 | | TrOCRForCausalLM | 32 | 0.9998 | 0.8789 | nan | nan | 0.9345 | 1.0129 | | MT5ForConditionalGeneration | 16 | 1.0015 | 0.864 | 0.415 | 1.0159 | 0.9203 | 1.0032 | | PLBartForCausalLM | 8 | 0.9999 | 0.8707 | 0.3624 | 1.0907 | 0.9166 | 0.989 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8798 | 0.3875 | nan | 0.9121 | 1.0221 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8497 | 0.3516 | 1.0867 | 0.8716 | 0.9439 | | Speech2Text2ForCausalLM | 256 | 0.9668 | 0.839 | 0.3505 | 1.0447 | 0.8672 | 0.9793 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.856 | 0.9327 | | M2M100ForConditionalGeneration | 16 | 0.9744 | 0.9205 | 0.406 | 1.0697 | 0.8468 | 1.023 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9902 | | MobileBertForMaskedLM | 64 | 0.9999 | 0.8791 | 0.3355 | nan | 0.6698 | 0.9649 | | DebertaV2ForMaskedLM | 1 | 0.9982 | 0.941 | 0.4918 | nan | 0.6117 | 0.9912 | | MobileBertForQuestionAnswering | 128 | 1.0159 | 1.0063 | 0.306 | nan | 0.5988 | 0.8126 | | Reformer | 16 | 0.9771 | 0.9998 | 0.5635 | 0.9998 | 0.5813 | 1.0027 | | DebertaV2ForQuestionAnswering | 2 | 0.9796 | 0.9796 | nan | nan | 0.5266 | 0.9885 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3623 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3252 | nan | 0.3071 | 1.1614 | | BlenderbotForCausalLM | 4 | 1.0002 | nan | nan | nan | nan | 0.9343 | | YituTechConvBert | 16 | 0.9954 | 0.9136 | 0.3774 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | Reformer | 16 | 299.8613 | 299.7951 | 306.4569 | 319.263 | 254.3895 | 250.5634 | | AlbertForMaskedLM | 4 | 266.5285 | 301.5118 | nan | nan | 161.1729 | 161.7473 | | AlbertForQuestionAnswering | 4 | 264.2786 | 298.9283 | nan | nan | 158.5859 | 159.4294 | | XLNetLMHeadModel | 8 | 276.0228 | 284.1911 | nan | nan | 143.2551 | 143.4202 | | PegasusForConditionalGeneration | 32 | 140.2368 | 153.9418 | nan | 175.1191 | 111.5204 | 112.782 | | TrOCRForCausalLM | 32 | 137.0812 | 142.033 | nan | nan | 109.7038 | 109.1742 | | DebertaV2ForQuestionAnswering | 2 | 132.727 | 146.5966 | nan | nan | 94.877 | 115.8318 | | BartForConditionalGeneration | 2 | 135.5849 | 139.9916 | nan | nan | 93.9736 | 96.1639 | | MBartForConditionalGeneration | 2 | 135.8062 | 140.6014 | nan | 184.5573 | 93.0837 | 95.2109 | | MegatronBertForQuestionAnswering | 8 | 141.0275 | 145.5381 | 182.8021 | nan | 86.9311 | 89.3016 | | DebertaV2ForMaskedLM | 1 | 112.8597 | 144.524 | 130.277 | nan | 83.7752 | 115.7271 | | MobileBertForQuestionAnswering | 128 | 173.0162 | 220.3584 | 181.3025 | nan | 83.2197 | 103.9109 | | BartForCausalLM | 4 | 112.1053 | 114.4559 | 148.0829 | nan | 78.8678 | 78.8551 | | BlenderbotSmallForConditionalGeneration | 64 | 108.7538 | 118.6411 | 152.331 | nan | 78.6637 | 78.3038 | | MBartForCausalLM | 4 | 112.1498 | 116.021 | 148.3091 | 113.0151 | 78.314 | 78.3912 | | CamemBert | 16 | 118.1072 | 121.1047 | 154.2823 | nan | 77.1005 | 77.4868 | | PLBartForCausalLM | 8 | 113.4132 | 114.5064 | 145.4805 | 115.4783 | 71.2118 | 70.4302 | | M2M100ForConditionalGeneration | 16 | 107.4146 | 127.3742 | 128.199 | 154.5852 | 70.5998 | 77.5964 | | PLBartForConditionalGeneration | 4 | 116.1474 | 121.4344 | 157.8974 | nan | 69.9748 | 70.5328 | | LayoutLMForMaskedLM | 16 | 112.0375 | 116.6458 | 148.0754 | nan | 69.3671 | 70.1569 | | OPTForCausalLM | 2 | 165.9279 | 175.8457 | nan | 209.6011 | 69.1973 | 69.7771 | | DistilBertForMaskedLM | 128 | 84.1293 | 87.5601 | 117.7071 | 128.944 | 68.7639 | 67.9756 | | BertForMaskedLM | 16 | 109.419 | 112.6992 | 145.4846 | nan | 68.4975 | 69.183 | | DistilBertForQuestionAnswering | 256 | 102.8143 | 103.2765 | 135.7433 | 154.4329 | 68.2365 | 68.877 | | RobertaForCausalLM | 16 | 114.4986 | 117.9048 | 150.4773 | nan | 67.4387 | 68.0322 | | DebertaForQuestionAnswering | 8 | 76.9695 | 87.4408 | 104.1397 | nan | 65.8752 | 61.4085 | | MobileBertForMaskedLM | 64 | 172.0871 | 215.2121 | 143.6068 | nan | 64.5809 | 118.8505 | | T5Small | 4 | 101.2299 | 110.3475 | 134.4387 | 87.323 | 63.1565 | 60.6853 | | T5ForConditionalGeneration | 4 | 100.9408 | 110.216 | 134.6336 | 88.0824 | 63.1345 | 60.7318 | | DistillGPT2 | 16 | 105.7416 | 109.1425 | 138.5764 | 142.0706 | 62.6451 | 61.6278 | | PegasusForCausalLM | 32 | 68.9871 | 72.7768 | 91.8643 | 80.8775 | 59.7418 | 59.7822 | | ElectraForQuestionAnswering | 64 | 114.4393 | 116.81 | 149.1052 | nan | 54.2357 | 55.8524 | | MegatronBertForCausalLM | 4 | 85.0806 | 94.9081 | 115.4499 | nan | 54.0767 | 56.1922 | | LayoutLMForSequenceClassification | 16 | 97.1292 | 99.225 | 125.3475 | nan | 52.7113 | 54.0883 | | XGLMForCausalLM | 8 | 84.7106 | 105.7642 | 94.3541 | nan | 52.6541 | 63.6312 | | RobertaForQuestionAnswering | 16 | 95.2592 | 97.123 | 123.1651 | nan | 52.5279 | 53.9629 | | BertForQuestionAnswering | 16 | 94.8107 | 96.5993 | 122.7138 | nan | 52.1699 | 53.4549 | | DebertaForMaskedLM | 4 | 67.2282 | 85.4869 | 79.4174 | nan | 49.9405 | 56.8679 | | BlenderbotSmallForCausalLM | 64 | 58.7411 | 63.149 | 81.683 | nan | 48.4465 | 48.0424 | | ElectraForCausalLM | 32 | 87.6598 | 92.4058 | 122.0501 | nan | 47.8711 | 47.8742 | | MT5ForConditionalGeneration | 16 | 91.6536 | 108.7988 | 86.1785 | 107.9132 | 40.3196 | 47.1825 | | Speech2Text2ForCausalLM | 256 | 53.0891 | 56.185 | 77.1915 | 58.7852 | 39.8711 | 38.9164 | | GPT2ForSequenceClassification | 4 | 91.9426 | 93.7716 | nan | 200.0079 | 39.1925 | 39.8588 | | BlenderbotForCausalLM | 4 | 90.6276 | nan | nan | nan | nan | 79.4009 | | YituTechConvBert | 16 | 133.0078 | 137.7873 | 168.7043 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0001 | 0.0 | 0.0 | 0.0 | 2.1798 | 1.7014 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9982 | 0.0 | 0.0 | 1.9873 | 1.9618 | | twins_pcpvt_base | 64 | 1.0059 | 0.9217 | 0.9095 | 0.0 | 1.8861 | 1.6772 | | coat_lite_mini | 128 | 1.0001 | 0.9958 | 0.847 | 1.1501 | 1.7685 | 1.7826 | | ghostnet_100 | 128 | 0.999 | 0.9771 | 0.8956 | 1.0099 | 1.6198 | 1.4595 | | volo_d1_224 | 64 | 0.9996 | 0.9937 | 0.8455 | 0.0 | 1.5928 | 1.5573 | | lcnet_050 | 128 | 0.968 | 0.9459 | 0.8561 | 1.0426 | 1.581 | 1.2808 | | cait_m36_384 | 4 | 1.0002 | 0.9876 | 0.0 | 0.0 | 1.5502 | 1.4814 | | gmlp_s16_224 | 128 | 0.9998 | 0.9956 | 0.7868 | 1.0096 | 1.5282 | 1.4999 | | gmixer_24_224 | 128 | 1.0 | 0.8796 | 0.7223 | 0.9239 | 1.525 | 1.5057 | | swin_base_patch4_window7_224 | 64 | 0.9995 | 0.9603 | 0.0 | 0.0 | 1.4697 | 1.4713 | | regnety_002 | 128 | 0.9768 | 0.9401 | 1.109 | 0.8596 | 1.4104 | 1.1869 | | jx_nest_base | 32 | 0.9998 | 0.993 | 0.7952 | 0.0 | 1.3903 | 1.3534 | | crossvit_9_240 | 128 | 0.9997 | 0.9949 | 0.8383 | 0.9183 | 1.3308 | 1.3056 | | convit_base | 64 | 0.9998 | 0.9968 | 0.8337 | 1.2317 | 1.2969 | 1.3723 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9916 | 0.7969 | 0.9757 | 1.2823 | 1.2585 | | pit_b_224 | 64 | 0.9998 | 0.9952 | 0.8221 | 0.9726 | 1.2787 | 1.2743 | | dm_nfnet_f0 | 128 | 0.998 | 0.9983 | 0.8804 | 0.9223 | 1.2777 | 1.2295 | | mixer_b16_224 | 128 | 0.9999 | 0.9975 | 0.8025 | 0.9009 | 1.2725 | 1.27 | | nfnet_l0 | 128 | 0.9951 | 0.8098 | 0.7107 | 0.8494 | 1.2495 | 1.1825 | | hrnet_w18 | 128 | 1.0042 | 1.0228 | 0.8917 | 0.0 | 1.2141 | 1.1238 | | beit_base_patch16_224 | 64 | 0.9998 | 0.979 | 0.0 | 0.0 | 1.2086 | 1.2033 | | convnext_base | 64 | 0.9992 | 0.9949 | 0.7965 | 0.0 | 1.1685 | 1.1528 | | mnasnet_100 | 128 | 0.9531 | 0.9448 | 0.79 | 1.2102 | 1.1564 | 1.1004 | | resmlp_12_224 | 128 | 1.0 | 0.9988 | 0.7831 | 1.4867 | 1.1529 | 1.1166 | | vit_base_patch16_224 | 64 | 0.9997 | 0.9937 | 0.8349 | 0.9097 | 1.129 | 1.1161 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 0.8547 | 1.1433 | 1.123 | 1.0665 | | adv_inception_v3 | 128 | 0.9997 | 0.9964 | 0.8547 | 1.1425 | 1.1136 | 1.0911 | | inception_v3 | 128 | 1.0002 | 0.9961 | 0.8542 | 1.1436 | 1.1021 | 1.0714 | | pnasnet5large | 16 | 1.0058 | 1.0302 | 0.8438 | 0.0 | 1.0944 | 1.0652 | | spnasnet_100 | 128 | 0.9452 | 0.9393 | 0.7772 | 1.0887 | 1.0643 | 1.0031 | | mixnet_l | 128 | 0.9796 | 0.9053 | 0.7947 | 0.0 | 1.0359 | 1.0032 | | fbnetc_100 | 128 | 0.9535 | 0.9447 | 0.794 | 1.1574 | 1.0358 | 0.9115 | | tf_mixnet_l | 128 | 0.9806 | 0.9073 | 0.7964 | 0.0 | 1.0304 | 0.987 | | mobilenetv3_large_100 | 128 | 0.9554 | 0.9449 | 0.7837 | 0.9881 | 1.0034 | 0.9801 | | fbnetv3_b | 128 | 0.9524 | 0.9421 | 0.7802 | 0.0 | 0.9734 | 0.9494 | | tf_efficientnet_b0 | 128 | 0.9657 | 0.8083 | 0.6674 | 0.9506 | 0.9674 | 0.8134 | | poolformer_m36 | 64 | 0.9996 | 0.9979 | 0.8081 | 0.0 | 0.9539 | 0.9516 | | res2net101_26w_4s | 64 | 1.005 | 1.0011 | 1.0068 | 0.0 | 0.8593 | 0.7682 | | selecsls42b | 128 | 0.9999 | 0.9955 | 0.8426 | 1.2817 | 0.8536 | 0.8584 | | dla102 | 128 | 1.0001 | 0.9963 | 0.8391 | 1.3143 | 0.8376 | 0.8189 | | rexnet_100 | 128 | 0.9649 | 0.8505 | 0.6914 | 0.0 | 0.8169 | 0.6897 | | tinynet_a | 128 | 0.9718 | 0.8054 | 0.6541 | 0.7843 | 0.811 | 0.7832 | | gernet_l | 128 | 0.9472 | 0.9387 | 0.7689 | 1.0521 | 0.8069 | 0.8277 | | cspdarknet53 | 64 | 0.9415 | 0.9355 | 0.757 | 1.1298 | 0.8035 | 0.8117 | | dpn107 | 32 | 0.9387 | 0.9334 | 0.7562 | 0.0 | 0.7969 | 0.7537 | | resnest101e | 64 | 1.0024 | 0.9911 | 0.8127 | 0.0 | 0.783 | 0.8038 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.9824 | 0.8107 | 0.0 | 0.7688 | 0.6575 | | mobilevit_s | 64 | 0.9729 | 0.8148 | 0.6485 | 0.0 | 0.7518 | 0.8171 | | gluon_xception65 | 32 | 1.0004 | 0.989 | 0.7554 | 0.0 | 0.7283 | 0.7193 | | visformer_small | 128 | 0.9994 | 1.0012 | 0.8424 | 0.0 | 0.7277 | 0.7417 | | convmixer_768_32 | 32 | 0.9998 | 0.9983 | 0.9233 | 0.0 | 0.7255 | 0.7432 | | repvgg_a2 | 128 | 0.9417 | 0.9351 | 0.7935 | 1.0682 | 0.6994 | 0.7209 | | sebotnet33ts_256 | 64 | 0.9654 | 0.8367 | 0.6807 | 0.9696 | 0.6861 | 0.6741 | | ese_vovnet19b_dw | 128 | 0.9705 | 0.9651 | 0.7688 | 1.12 | 0.649 | 0.7146 | | res2net50_14w_8s | 128 | 0.9999 | 0.9931 | 0.812 | 0.9923 | 0.6367 | 0.6068 | | mobilenetv2_100 | 128 | 0.9503 | 0.9415 | 0.7232 | 1.1257 | 0.6111 | 0.6437 | | eca_botnext26ts_256 | 128 | 0.9803 | 0.8119 | 0.6719 | 1.0718 | 0.6039 | 0.6495 | | botnet26t_256 | 128 | 0.9794 | 0.9735 | 0.8123 | 1.2777 | 0.5782 | 0.599 | | res2next50 | 128 | 0.9995 | 0.9953 | 0.8336 | 1.1455 | 0.5625 | 0.6006 | | eca_halonext26ts | 128 | 0.9808 | 0.8168 | 0.6796 | 0.0 | 0.0 | 0.6228 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | ghostnet_100 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 7.1382 | 30.6307 | 60.7049 | nan | 207.5764 | 200.8124 | | pnasnet5large | 16 | 5.7933 | 23.2309 | 41.4979 | nan | 199.2617 | 194.6135 | | res2net50_14w_8s | 128 | 3.134 | 14.7254 | 25.0077 | 334.6095 | 196.7245 | 194.4486 | | ghostnet_100 | 128 | 3.3059 | 9.8021 | 14.7198 | 190.097 | 194.5552 | 194.4957 | | twins_pcpvt_base | 64 | 2.9491 | 15.3613 | 25.7912 | nan | 141.1447 | 139.377 | | res2net101_26w_4s | 64 | 3.5691 | 16.4722 | 29.6274 | nan | 137.0176 | 132.124 | | dpn107 | 32 | 4.1495 | 13.6584 | 39.566 | nan | 130.6184 | 129.4981 | | rexnet_100 | 128 | 2.0424 | 7.5076 | 16.9585 | nan | 124.0767 | 121.1943 | | mobilevit_s | 64 | 2.1205 | 7.4838 | 15.9728 | nan | 120.6869 | 115.077 | | fbnetv3_b | 128 | 3.4263 | 11.7534 | 28.3009 | nan | 114.8787 | 112.0847 | | resnest101e | 64 | 3.9547 | 16.2077 | 27.3929 | nan | 111.852 | 110.8699 | | tf_mixnet_l | 128 | 5.8733 | 13.2223 | 27.0305 | nan | 104.8769 | 102.7426 | | mixnet_l | 128 | 5.6523 | 12.8772 | 27.367 | nan | 103.9139 | 101.197 | | tinynet_a | 128 | 2.2991 | 8.0408 | 19.8386 | 196.3944 | 101.7799 | 99.625 | | adv_inception_v3 | 128 | 1.923 | 8.588 | 13.4662 | 181.0693 | 100.825 | 97.9333 | | inception_v3 | 128 | 1.8086 | 8.6244 | 13.5287 | 186.1773 | 100.1217 | 98.428 | | gluon_inception_v3 | 128 | 1.8092 | 8.7861 | 13.2291 | 183.9305 | 100.0097 | 97.5906 | | fbnetc_100 | 128 | 2.188 | 6.7344 | 17.0788 | 135.8096 | 93.1561 | 91.6558 | | dla102 | 128 | 2.0467 | 9.5655 | 15.3722 | 242.0405 | 92.1808 | 89.5413 | | spnasnet_100 | 128 | 2.2324 | 6.6108 | 16.8729 | 133.0404 | 86.8534 | 84.624 | | cspdarknet53 | 64 | 2.468 | 7.4381 | 18.826 | 148.5819 | 86.3409 | 84.1533 | | poolformer_m36 | 64 | 1.8417 | 7.2885 | 12.1295 | nan | 84.9081 | 81.2099 | | mobilenetv3_large_100 | 128 | 1.7421 | 5.7068 | 13.7883 | 141.3526 | 84.6351 | 84.2664 | | xcit_large_24_p8_224 | 5 | 3.5878 | nan | nan | nan | 84.3114 | 80.9055 | | res2next50 | 128 | 1.9105 | 8.1849 | 12.8626 | 200.126 | 84.2448 | 82.4441 | | tf_efficientnet_b0 | 128 | 1.9957 | 6.9116 | 16.1898 | 179.269 | 81.8606 | 79.7389 | | swin_base_patch4_window7_224 | 64 | 3.1555 | 12.8772 | nan | nan | 79.8223 | 77.6069 | | sebotnet33ts_256 | 64 | 1.8342 | 6.1161 | 13.2854 | 150.9501 | 79.011 | 77.6793 | | gluon_xception65 | 32 | 2.5206 | 11.1232 | 17.9677 | nan | 75.1248 | 72.5451 | | coat_lite_mini | 128 | 1.1844 | 5.1934 | 8.1782 | 113.4021 | 74.8003 | 73.719 | | convnext_base | 64 | 1.5443 | 7.0903 | 11.9054 | nan | 73.7241 | 72.9432 | | mobilenetv2_100 | 128 | 1.775 | 5.6879 | 13.0603 | 119.0669 | 73.5037 | 73.8715 | | mnasnet_100 | 128 | 1.6965 | 5.3417 | 13.4361 | 105.5462 | 72.8093 | 71.6167 | | swsl_resnext101_32x16d | 32 | 2.049 | 9.2423 | 14.771 | nan | 72.3473 | 69.357 | | cait_m36_384 | 4 | 3.7244 | 19.865 | nan | nan | 72.0915 | 69.2889 | | regnety_002 | 128 | 1.7467 | 5.6593 | 13.0926 | 116.5973 | 68.0563 | 67.253 | | jx_nest_base | 32 | 2.1427 | 9.1354 | 16.2218 | nan | 66.3362 | 65.2566 | | dm_nfnet_f0 | 128 | 2.1575 | 7.3995 | 10.7492 | 160.5403 | 66.0004 | 64.708 | | eca_botnext26ts_256 | 128 | 1.4711 | 4.8286 | 10.2609 | 120.4131 | 60.6206 | 59.5059 | | visformer_small | 128 | 0.9433 | 4.1223 | 6.1299 | nan | 59.6406 | 58.406 | | botnet26t_256 | 128 | 1.3436 | 4.3524 | 9.1449 | 91.787 | 58.4383 | 56.6115 | | ese_vovnet19b_dw | 128 | 1.012 | 3.1747 | 6.6287 | 67.0142 | 54.0795 | 53.3578 | | selecsls42b | 128 | 0.8198 | 3.7491 | 5.7882 | 87.4336 | 53.1991 | 51.8345 | | lcnet_050 | 128 | 1.0113 | 3.3144 | 7.4532 | 81.3733 | 52.4704 | 50.9969 | | gernet_l | 128 | 2.1482 | 6.1097 | 15.3379 | 115.4506 | 50.8977 | 49.8336 | | nfnet_l0 | 128 | 1.929 | 7.0856 | 10.7905 | 146.8857 | 48.5829 | 47.4466 | | gmlp_s16_224 | 128 | 1.3815 | 7.4177 | 12.0737 | 198.0962 | 46.5627 | 44.6099 | | volo_d1_224 | 64 | 1.4641 | 7.4879 | 12.1342 | nan | 43.0896 | 41.1358 | | crossvit_9_240 | 128 | 1.9068 | 8.497 | 13.5646 | 202.781 | 41.0781 | 39.7987 | | tnt_s_patch16_224 | 128 | 1.968 | 10.7497 | nan | nan | 40.7303 | 38.7771 | | gmixer_24_224 | 128 | 1.6837 | 8.3341 | 13.4185 | 186.6649 | 38.4944 | 37.0519 | | repvgg_a2 | 128 | 2.0816 | 6.0563 | 16.773 | 190.0103 | 37.0236 | 35.7174 | | convmixer_768_32 | 32 | 1.3894 | 6.3341 | 9.9572 | nan | 32.3611 | 31.0632 | | convit_base | 64 | 1.2323 | 5.9017 | 9.3349 | 145.9939 | 31.2775 | 29.7701 | | mixer_b16_224 | 128 | 0.8009 | 3.8587 | 5.8174 | 83.2447 | 26.8426 | 25.7415 | | deit_base_distilled_patch16_224 | 64 | 0.9833 | 4.5726 | 7.6097 | 86.6364 | 24.9807 | 23.623 | | pit_b_224 | 64 | 1.1339 | 5.1912 | 8.2672 | 109.2778 | 24.5647 | 23.5466 | | resmlp_12_224 | 128 | 0.6715 | 3.009 | 4.8201 | 53.6285 | 24.0111 | 23.8247 | | vit_base_patch16_224 | 64 | 0.9829 | 4.4439 | 7.1602 | 90.5348 | 23.3795 | 22.4265 | | beit_base_patch16_224 | 64 | 1.315 | 5.4397 | nan | nan | 22.5126 | 21.4147 | | eca_halonext26ts | 128 | 1.4707 | 5.0609 | 10.6052 | nan | nan | 76.1581 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.4726 | 1.1635 | 1.4912 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0842 | 1.1492 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.0576 | 1.2923 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.045 | 1.3028 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.038 | 1.1389 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.0011 | 1.2582 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2271 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 0.9796 | 0.9842 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3402 | nan | 0.9745 | 1.0806 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 0.9567 | 1.1358 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9484 | 1.057 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9464 | 0.9678 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 0.9296 | 1.0969 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9291 | 0.9903 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9289 | 0.9803 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 0.9185 | 1.2283 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 0.9137 | 1.123 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9066 | 0.9846 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 0.8962 | 1.1046 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3594 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2718 | nan | 0.8815 | 0.98 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 0.8765 | 1.1944 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.8723 | 1.0162 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.315 | nan | 0.8648 | 1.0056 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.852 | 0.9728 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.8455 | 0.944 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8442 | 0.965 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8368 | 0.9122 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8146 | 0.9442 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8239 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.8041 | 1.0135 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8022 | 1.0085 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3529 | 0.8765 | 0.7927 | 0.9534 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.787 | 0.9294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7861 | 1.0072 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.7727 | 0.9234 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.7713 | 0.9528 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.7707 | 1.0052 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.7697 | 0.9414 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.7605 | 0.942 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.7499 | 0.9634 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.7318 | 0.8133 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.7239 | 0.9334 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.7101 | 0.9306 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.6955 | 0.8352 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6706 | 0.8617 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.6615 | 0.9434 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.5858 | 0.8993 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5572 | 0.8383 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | 1.206 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.6123 | 296.9164 | 321.1545 | nan | 408.7444 | 398.9714 | | hrnet_w18 | 128 | 301.4289 | 290.0878 | 352.6004 | nan | 248.0358 | 270.1309 | | res2next50 | 128 | 138.2161 | 138.5627 | 166.2852 | 120.4655 | 246.3666 | 230.1575 | | res2net50_14w_8s | 128 | 146.0825 | 146.9282 | 179.7486 | 147.8972 | 230.0818 | 243.0809 | | dla102 | 128 | 178.6447 | 179.1081 | 212.9949 | 135.7274 | 213.2333 | 218.0643 | | resnest101e | 64 | 168.1158 | 164.5634 | 199.4316 | nan | 208.0718 | 204.1968 | | pnasnet5large | 16 | 231.3828 | 214.2952 | 258.8296 | nan | 204.5995 | 211.0721 | | tf_mixnet_l | 128 | 195.2281 | 210.7869 | 240.4766 | nan | 185.5949 | 193.8769 | | tnt_s_patch16_224 | 128 | 363.4312 | 364.1331 | nan | nan | 182.862 | 185.194 | | eca_botnext26ts_256 | 128 | 112.1983 | 135.4982 | 163.5949 | 102.4426 | 182.1183 | 169.1464 | | botnet26t_256 | 128 | 105.9555 | 106.6882 | 127.8074 | 81.2445 | 179.5448 | 173.4188 | | mixnet_l | 128 | 186.6282 | 202.3695 | 230.3064 | nan | 176.8863 | 182.339 | | poolformer_m36 | 64 | 149.0658 | 149.1357 | 184.3745 | nan | 156.3298 | 156.1852 | | swsl_resnext101_32x16d | 32 | 118.3021 | 119.9805 | 145.7654 | nan | 153.8851 | 179.5024 | | res2net101_26w_4s | 64 | 120.6093 | 121.7832 | 130.2964 | nan | 147.4957 | 163.2704 | | inception_v3 | 128 | 160.7852 | 161.3262 | 188.2592 | 140.6375 | 146.1733 | 149.9892 | | adv_inception_v3 | 128 | 161.2774 | 161.7936 | 188.9603 | 141.1285 | 144.6758 | 147.7535 | | gluon_inception_v3 | 128 | 161.2292 | 161.9164 | 188.9537 | 140.9712 | 143.6622 | 151.0161 | | convit_base | 64 | 181.5069 | 181.8341 | 217.3605 | 147.0691 | 139.7921 | 132.1471 | | dpn107 | 32 | 114.0089 | 115.7477 | 142.4159 | nan | 137.4786 | 143.5997 | | visformer_small | 128 | 98.3535 | 98.163 | 116.687 | nan | 134.7867 | 132.1578 | | gluon_xception65 | 32 | 98.029 | 98.9966 | 129.3808 | nan | 134.3485 | 136.1547 | | fbnetv3_b | 128 | 120.8415 | 122.333 | 148.9436 | nan | 124.5483 | 124.1783 | | pit_b_224 | 64 | 155.0952 | 155.618 | 188.3762 | 159.2575 | 121.0025 | 121.3689 | | sebotnet33ts_256 | 64 | 83.4163 | 96.1346 | 118.1704 | 83.0302 | 117.1748 | 119.2208 | | mobilevit_s | 64 | 90.2864 | 107.338 | 135.3195 | nan | 116.3972 | 107.0715 | | cspdarknet53 | 64 | 96.0841 | 96.7606 | 119.6186 | 80.0411 | 112.8645 | 111.3442 | | cait_m36_384 | 4 | 166.4608 | 168.063 | nan | nan | 112.6382 | 116.8608 | | beit_base_patch16_224 | 64 | 135.0296 | 138.0831 | nan | nan | 111.6334 | 112.0508 | | repvgg_a2 | 128 | 79.9609 | 80.3555 | 94.8011 | 70.3476 | 107.5871 | 104.3131 | | rexnet_100 | 128 | 91.0688 | 103.1907 | 127.1689 | nan | 107.4483 | 127.2614 | | vit_base_patch16_224 | 64 | 120.6352 | 121.1857 | 144.2308 | 132.3415 | 106.7908 | 107.8345 | | mobilenetv2_100 | 128 | 67.758 | 68.3913 | 89.1057 | 57.1085 | 105.3731 | 99.9443 | | convnext_base | 64 | 121.5976 | 122.1142 | 152.7608 | nan | 103.9037 | 105.2548 | | dm_nfnet_f0 | 128 | 131.2331 | 131.056 | 148.9614 | 142.0141 | 102.2464 | 106.6476 | | ese_vovnet19b_dw | 128 | 67.8516 | 68.2377 | 85.8423 | 58.8443 | 101.6139 | 92.2555 | | swin_base_patch4_window7_224 | 64 | 147.6268 | 153.3407 | nan | nan | 100.3346 | 100.0115 | | gernet_l | 128 | 79.8375 | 80.5166 | 98.4599 | 71.7827 | 93.8833 | 91.156 | | mixer_b16_224 | 128 | 118.289 | 118.6575 | 147.4802 | 131.415 | 93.0924 | 93.121 | | tinynet_a | 128 | 75.4122 | 90.8647 | 110.8036 | 93.0161 | 92.2093 | 94.9887 | | tf_efficientnet_b0 | 128 | 90.6326 | 108.3032 | 131.0475 | 92.0341 | 90.5296 | 107.6522 | | gmlp_s16_224 | 128 | 136.2596 | 136.9138 | 173.1997 | 134.9127 | 89.2966 | 90.829 | | jx_nest_base | 32 | 119.2737 | 119.6494 | 149.6213 | nan | 85.4775 | 87.9317 | | volo_d1_224 | 64 | 134.795 | 135.3969 | 159.0003 | nan | 84.4322 | 86.172 | | nfnet_l0 | 128 | 106.0859 | 130.4147 | 148.8087 | 124.8812 | 84.397 | 89.6487 | | crossvit_9_240 | 128 | 109.2155 | 109.723 | 130.5546 | 119.025 | 82.1759 | 83.7825 | | fbnetc_100 | 128 | 87.8082 | 88.6466 | 105.6237 | 72.4597 | 80.9447 | 91.9397 | | gmixer_24_224 | 128 | 120.0092 | 136.2182 | 166.4488 | 129.8398 | 78.848 | 79.7822 | | selecsls42b | 128 | 62.8861 | 63.1339 | 74.6655 | 48.9435 | 73.7234 | 73.2374 | | deit_base_distilled_patch16_224 | 64 | 94.187 | 94.7764 | 118.0644 | 96.3248 | 73.4859 | 74.8199 | | twins_pcpvt_base | 64 | 123.5456 | 138.6285 | 137.9018 | nan | 70.1793 | 78.6765 | | spnasnet_100 | 128 | 76.8527 | 77.1507 | 93.3409 | 66.6638 | 68.3286 | 72.3391 | | coat_lite_mini | 128 | 116.021 | 116.4099 | 136.8176 | 100.7737 | 65.6644 | 65.0154 | | mobilenetv3_large_100 | 128 | 65.9539 | 66.6388 | 80.5173 | 63.7926 | 62.7224 | 64.2583 | | xcit_large_24_p8_224 | 5 | 155.8777 | nan | nan | nan | 61.69 | 94.009 | | ghostnet_100 | 128 | 100.9266 | 98.2438 | 107.8686 | 95.0292 | 60.4602 | 66.8888 | | resmlp_12_224 | 128 | 68.3099 | 68.3709 | 87.3291 | 45.9397 | 59.2259 | 61.1655 | | mnasnet_100 | 128 | 70.0888 | 70.8353 | 84.6752 | 55.1914 | 57.9244 | 60.8037 | | regnety_002 | 128 | 53.8756 | 56.3136 | 46.9591 | 61.123 | 38.3819 | 46.2874 | | lcnet_050 | 128 | 33.6606 | 34.4721 | 38.6407 | 31.5309 | 21.9173 | 26.2496 | | eca_halonext26ts | 128 | 115.9619 | 139.3538 | 167.6424 | nan | nan | 182.53 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/3uq63Lc.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/WHEw9HO.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/LcTvJ8N.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 54/56 | 98%, 45/46  | 98%, 60/61  |
|       aot_eager        | 93%, 52/56 | 98%, 45/46  | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 41/56 | 35%, 16/46  | 46%, 28/61  |
|    nvprims_nvfuser     | 77%, 43/56 | 61%, 28/46  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 87%, 40/46  | 93%, 57/61  |
| inductor_no_cudagraphs | 89%, 50/56 | 93%, 43/46  | 93%, 57/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.00x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.04x    |    1.14x    |
|        inductor        |   1.47x    |    1.23x    |    1.23x    |
| inductor_no_cudagraphs |   1.23x    |    1.21x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.81    |    2.58     |    2.03     |
|       aot_eager        |    6.03    |    8.93     |    8.01     |
|     aot_cudagraphs     |    7.83    |    16.30    |    13.76    |
|    nvprims_nvfuser     |   61.87    |    89.81    |   140.94    |
|        inductor        |   32.84    |    36.18    |    37.32    |
| inductor_no_cudagraphs |   32.77    |    30.82    |    36.22    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.92x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.90x    |    1.01x    |    0.95x    |
|        inductor        |   0.83x    |    0.74x    |    0.97x    |
| inductor_no_cudagraphs |   0.99x    |    1.00x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Previous report name: /data/home/anijain/cluster/cron_logs/day_322_18_11_22_performance_float32_288 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 84%, 47/56 | 82%, 46/56 | | inductor | huggingface | 83%, 39/47 | 83%, 39/47 | | inductor | timm_models | 93%, 57/61 | 93%, 57/61 | | inductor_no_cudagraphs | torchbench | 89%, 50/56 | 89%, 50/56 | | inductor_no_cudagraphs | huggingface | 87%, 41/47 | 87%, 41/47 | | inductor_no_cudagraphs | timm_models | 93%, 57/61 | 93%, 57/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.47x | 1.47x | | inductor | huggingface | 1.25x | 1.23x | | inductor | timm_models | 1.23x | 1.23x | | inductor_no_cudagraphs | torchbench | 1.23x | 1.23x | | inductor_no_cudagraphs | huggingface | 1.22x | 1.22x | | inductor_no_cudagraphs | timm_models | 1.23x | 1.23x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+---------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v2_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | deit_base_distilled_patch16_224 | pass | fail_accuracy | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | +-------------+---------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | soft_actor_critic | 1.4732 | 0.9317 | | torchbench | nvidia_deeprecommender | 0.8351 | 0.8842 | | torchbench | hf_GPT2_large | 0.0 | 1.4716 | | torchbench | hf_T5 | 0.0 | 1.5483 | | torchbench | tacotron2 | 0.0 | 0.9198 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.0071 | 0.8461 | | huggingface | DebertaV2ForQuestionAnswering | 0.9268 | 0.9129 | | huggingface | TrOCRForCausalLM | 0.0 | 1.0236 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.015 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5123 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | yolov3 | 363.4076 | 364.7413 | | torchbench | timm_efficientdet | 128.4604 | 128.7663 | | huggingface | XLNetLMHeadModel | 138.2006 | 130.5974 | | huggingface | DebertaV2ForQuestionAnswering | 132.9739 | 48.9509 | | huggingface | DebertaV2ForMaskedLM | 130.6191 | 50.5206 | +-------------+-------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0023 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8675 | 0.896 | | torchbench | hf_T5_large | 0.8643 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | pytorch_unet | 0.8484 | 1.0138 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | dlrm | 0.7932 | 0.8152 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.7501 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6882 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5647 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4213 | 0.4334 | | torchbench | dcgan | 0.2564 | 0.2576 | | huggingface | YituTechConvBert | 0.894 | 0.9822 | | huggingface | DistillGPT2 | 0.8939 | 1.0108 | | huggingface | M2M100ForConditionalGeneration | 0.8659 | 1.0404 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | PegasusForConditionalGeneration | 0.8637 | 1.0262 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | PLBartForCausalLM | 0.8367 | 1.0581 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | T5ForConditionalGeneration | 0.8129 | 1.1049 | | huggingface | T5Small | 0.8129 | 1.1049 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | MBartForConditionalGeneration | 0.7896 | 0.9837 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | MT5ForConditionalGeneration | 0.7748 | 0.9324 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.958 | | huggingface | MegatronBertForQuestionAnswering | 0.7709 | 1.0379 | | huggingface | MegatronBertForCausalLM | 0.7673 | 1.0153 | | huggingface | MBartForCausalLM | 0.7326 | 0.9478 | | huggingface | BertForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | RobertaForQuestionAnswering | 0.7273 | 1.0274 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | BertForMaskedLM | 0.6945 | 0.9772 | | huggingface | CamemBert | 0.6942 | 0.9746 | | huggingface | RobertaForCausalLM | 0.6942 | 0.9771 | | huggingface | Speech2Text2ForCausalLM | 0.675 | 0.9168 | | huggingface | DistilBertForQuestionAnswering | 0.6589 | 0.9118 | | huggingface | DistilBertForMaskedLM | 0.6509 | 0.9194 | | huggingface | DebertaV2ForMaskedLM | 0.5682 | 0.9491 | | huggingface | MobileBertForMaskedLM | 0.4951 | 0.6649 | | huggingface | DebertaV2ForQuestionAnswering | 0.4735 | 0.984 | | huggingface | MobileBertForQuestionAnswering | 0.4145 | 0.535 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | BlenderbotForCausalLM | nan | 0.8509 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8525 | 1.0752 | | timm_models | convnext_base | 0.8485 | 1.0335 | | timm_models | sebotnet33ts_256 | 0.8189 | 0.9416 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | coat_lite_mini | 0.8154 | 1.0235 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7449 | 0.9008 | | timm_models | crossvit_9_240 | 0.6742 | 0.9001 | | timm_models | tnt_s_patch16_224 | nan | 0.8633 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Performance speedup regressions ~~~ +------------------------+------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------------+-------------+------------+ | inductor_no_cudagraphs | nvidia_deeprecommender | 0.9645 | 0.8842 | +------------------------+------------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 No regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0013 | 1.0267 | 2.2958 | 0.8005 | 5.2596 | 1.2855 | | timm_efficientdet | 1 | 0.9826 | 0.892 | 1.8138 | 0.7536 | 4.5251 | 1.5651 | | timm_vision_transformer | 8 | 1.0056 | 0.9467 | 1.5358 | 0.6489 | 2.6408 | 1.4021 | | drq | 1 | 1.0268 | 0.876 | 1.6696 | 0.7266 | 2.5182 | 1.1341 | | BERT_pytorch | 16 | 1.0108 | 0.8904 | 1.1212 | 0.9408 | 2.1095 | 2.063 | | resnext50_32x4d | 8 | 1.0005 | 1.1066 | 1.2004 | 0.8073 | 2.0089 | 1.2036 | | resnet18 | 16 | 0.9951 | 1.1194 | 1.1656 | 0.874 | 2.0014 | 1.2143 | | mobilenet_v3_large | 32 | 1.0031 | 1.1258 | 1.0402 | 0.8619 | 1.9923 | 1.3629 | | squeezenet1_1 | 32 | 0.9897 | 1.0215 | 1.0349 | 0.8885 | 1.9602 | 1.3015 | | dcgan | 32 | 0.9984 | 1.0347 | 1.2705 | 0.8039 | 1.929 | 1.0162 | | pytorch_struct | 200 | 0.9963 | 0.7494 | 0.8871 | 0.8119 | 1.8081 | 1.1469 | | lennard_jones | 1000 | 0.9641 | 0.8718 | 1.0247 | 0.6826 | 1.8045 | 0.9683 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9967 | 1.0187 | 1.2076 | 0.85 | 1.7014 | 1.5401 | | hf_T5_large | 2 | 1.0272 | 0.9111 | 0.0 | 0.0 | 1.6498 | 1.6105 | | hf_Albert | 8 | 1.001 | 0.9961 | 0.7529 | 1.5578 | 1.6443 | 1.6409 | | shufflenet_v2_x1_0 | 128 | 1.0013 | 1.05 | 0.8086 | 0.8988 | 1.5881 | 1.4301 | | timm_resnest | 32 | 0.9993 | 1.0024 | 0.8047 | 1.1507 | 1.5231 | 1.4532 | | hf_GPT2 | 4 | 1.0114 | 0.9814 | 0.7693 | 0.3986 | 1.501 | 1.4981 | | mnasnet1_0 | 32 | 0.9993 | 1.0978 | 0.8631 | 0.92 | 1.4773 | 1.2705 | | soft_actor_critic | 256 | 0.9714 | 0.7856 | 1.0822 | 0.6981 | 1.4732 | 0.9317 | | mobilenet_v2 | 96 | 0.9997 | 0.9988 | 0.7301 | 1.3378 | 1.4299 | 1.4023 | | speech_transformer | 32 | 0.9923 | 0.8794 | 1.3964 | 0.7495 | 1.4287 | 1.4267 | | timm_efficientnet | 32 | 0.9556 | 0.8085 | 0.7008 | 0.8189 | 1.4259 | 1.1974 | | fastNLP_Bert | 6 | 0.9967 | 0.9758 | 0.752 | 1.1377 | 1.4171 | 1.3871 | | LearningToPaint | 96 | 1.0014 | 1.0562 | 0.8677 | 0.9807 | 1.2704 | 1.2192 | | pytorch_stargan | 16 | 0.9985 | 1.0756 | 0.9321 | 0.0 | 1.264 | 1.2281 | | resnet152 | 32 | 1.0008 | 1.0531 | 0.796 | 0.9086 | 1.232 | 1.2032 | | hf_Bert | 4 | 1.023 | 0.9919 | 0.7351 | 0.853 | 1.2084 | 1.1799 | | resnet50 | 32 | 0.999 | 0.9932 | 0.7605 | 0.9898 | 1.2071 | 1.1825 | | hf_Bart | 4 | 1.0159 | 0.9777 | 0.7393 | 0.8547 | 1.1985 | 1.182 | | timm_nfnet | 128 | 0.9994 | 1.0 | 0.0 | 1.1342 | 1.1963 | 1.1623 | | pytorch_unet | 1 | 0.9992 | 0.2814 | 0.0 | 0.0 | 1.193 | 1.178 | | hf_DistilBert | 8 | 1.0001 | 0.9571 | 0.687 | 0.5248 | 1.1756 | 1.1805 | | vgg16 | 64 | 0.9998 | 0.9987 | 0.8585 | 0.9977 | 1.1738 | 1.1662 | | alexnet | 128 | 0.9992 | 0.9962 | 0.8051 | 1.0044 | 1.1578 | 1.1579 | | Super_SloMo | 6 | 0.9999 | 0.2422 | 0.0 | 0.2479 | 1.154 | 1.1393 | | hf_Reformer | 4 | 0.9977 | 1.0016 | 0.9891 | 0.7341 | 1.1314 | 1.1443 | | timm_regnet | 32 | 0.9661 | 0.9632 | 0.7831 | 1.0939 | 1.1294 | 1.0922 | | Background_Matting | 4 | 1.0003 | 0.1918 | 0.0 | 0.0 | 1.0865 | 1.0783 | | yolov3 | 16 | 0.9999 | 0.9948 | 0.7899 | 1.1525 | 1.0857 | 1.0732 | | attention_is_all_you_need_pytorch | 256 | 0.9998 | 0.9698 | 0.7591 | 0.9544 | 1.0541 | 1.0386 | | timm_vision_transformer_large | 8 | 1.0 | 0.9951 | 0.0 | 0.0 | 1.0488 | 1.0342 | | mobilenet_v2_quantized_qat | 96 | 1.0014 | 0.98 | 0.0 | 1.4569 | 1.0483 | 1.1078 | | timm_vovnet | 32 | 0.9103 | 0.9039 | 0.7138 | 0.8981 | 1.0038 | 1.0209 | | demucs | 4 | 0.9998 | 0.9998 | 1.0004 | 1.0004 | 1.0004 | 1.0003 | | tts_angular | 64 | 0.9796 | 0.97 | 0.9806 | 0.9684 | 0.9999 | 1.0089 | | resnet50_quantized_qat | 32 | 1.0011 | 0.9712 | 0.0 | 1.1481 | 0.9998 | 0.9868 | | dlrm | 1024 | 1.4515 | 0.6016 | 0.0 | 0.628 | 0.9839 | 1.3308 | | nvidia_deeprecommender | 256 | 0.9993 | 0.9635 | 0.585 | 0.9764 | 0.8351 | 0.8842 | | hf_GPT2_large | 4 | 0.9998 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4716 | | hf_T5 | 8 | 1.0007 | 0.951 | 0.0 | 1.1713 | 0.0 | 1.5483 | | tacotron2 | 64 | 0.9687 | 0.8655 | 0.0 | 0.7677 | 0.0 | 0.9198 | | functorch_dp_cifar10 | 64 | 1.0011 | 1.0238 | 2.1576 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9584 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8997 | 7.5824 | 10.6916 | 113.2917 | 363.4076 | 364.7413 | | timm_efficientdet | 1 | 19.6144 | 35.5198 | 67.3364 | 447.2823 | 128.4604 | 128.7663 | | hf_T5_large | 2 | 13.9116 | 38.3275 | nan | nan | 111.5199 | 110.1792 | | mobilenet_v2_quantized_qat | 96 | 1.292 | 8.891 | nan | 180.7885 | 86.2617 | 85.1447 | | resnet50_quantized_qat | 32 | 1.2007 | 8.5395 | nan | 172.2238 | 70.3024 | 70.9238 | | timm_nfnet | 128 | 2.0923 | 6.6404 | nan | 143.0401 | 60.23 | 59.6406 | | timm_resnest | 32 | 0.5873 | 2.271 | 3.2841 | 62.1344 | 57.1806 | 50.9202 | | timm_vision_transformer_large | 8 | 2.5074 | 12.7478 | nan | nan | 55.9496 | 52.9135 | | timm_efficientnet | 32 | 1.7712 | 6.0909 | 14.0973 | 108.7716 | 54.6285 | 60.3561 | | mobilenet_v3_large | 32 | 0.8987 | 4.3559 | 6.3498 | 98.5395 | 50.0016 | 50.0559 | | densenet121 | 4 | 2.1855 | 11.2911 | 17.6022 | 153.6778 | 43.7795 | 43.0777 | | timm_regnet | 32 | 2.2146 | 7.2579 | 17.4069 | 110.8243 | 43.7536 | 42.3201 | | attention_is_all_you_need_pytorch | 256 | 1.25 | 6.3989 | 9.9928 | 97.3589 | 43.0386 | 41.8656 | | resnet152 | 32 | 2.4728 | 11.9847 | 18.9362 | 169.9451 | 42.8807 | 41.9777 | | timm_vision_transformer | 8 | 0.8409 | 3.8483 | 5.3591 | 74.5575 | 31.4138 | 31.8642 | | hf_Bart | 4 | 1.6748 | 7.439 | 10.9471 | 108.7871 | 30.1422 | 29.7987 | | fastNLP_Bert | 6 | 1.5506 | 6.2673 | 9.2276 | 75.9537 | 27.5413 | 25.4919 | | BERT_pytorch | 16 | 1.5393 | 6.7164 | 9.6318 | 76.5472 | 27.3456 | 27.2226 | | pytorch_stargan | 16 | 0.4266 | 1.9764 | 2.6842 | nan | 26.3206 | 23.2142 | | speech_transformer | 32 | 1.6168 | 7.2492 | 26.6365 | 115.3434 | 24.8163 | 23.5458 | | Super_SloMo | 6 | 1.0816 | 7.0064 | nan | 56.6523 | 19.8813 | 19.516 | | hf_Bert | 4 | 1.618 | 6.251 | 8.4264 | 80.7479 | 19.7248 | 19.5429 | | mnasnet1_0 | 32 | 0.8269 | 3.9346 | 5.7235 | 72.2072 | 18.8995 | 18.1813 | | pytorch_struct | 200 | 0.2478 | 0.6954 | 1.2111 | 5.502 | 18.4378 | 21.9078 | | timm_vovnet | 32 | 1.5129 | 4.1728 | 9.0177 | 58.358 | 18.3768 | 17.8546 | | resnet50 | 32 | 0.8957 | 4.1343 | 6.0512 | 78.0484 | 18.3223 | 17.5468 | | shufflenet_v2_x1_0 | 128 | 0.9681 | 4.6296 | 6.8235 | 89.739 | 18.2208 | 17.7663 | | Background_Matting | 4 | 0.7527 | 8.399 | nan | nan | 18.107 | 17.1636 | | hf_GPT2 | 4 | 1.4825 | 5.587 | 8.4205 | 64.9843 | 18.0904 | 17.1515 | | resnext50_32x4d | 8 | 0.9198 | 4.1628 | 5.8195 | 66.2397 | 17.9152 | 17.5952 | | hf_Reformer | 4 | 1.6137 | 2.8433 | 4.9241 | 15.9735 | 17.8028 | 15.5749 | | hf_Albert | 8 | 1.2828 | 5.3111 | 7.9066 | 105.4593 | 17.7891 | 17.4421 | | mobilenet_v2 | 96 | 0.8666 | 4.1843 | 6.3244 | 96.376 | 16.9952 | 16.6683 | | hf_DistilBert | 8 | 0.674 | 2.7962 | 5.0602 | 44.5101 | 12.034 | 11.9215 | | resnet18 | 16 | 0.4209 | 1.643 | 2.2993 | 31.1081 | 10.7373 | 10.4682 | | pytorch_unet | 1 | 0.4626 | 2.9789 | nan | nan | 9.0094 | 8.9456 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4076 | 1.7623 | 2.4271 | 32.6039 | 8.3865 | 8.0738 | | LearningToPaint | 96 | 0.4413 | 1.7442 | 2.5529 | 39.9559 | 7.0941 | 6.8661 | | dcgan | 32 | 0.1699 | 0.3936 | 0.5972 | 4.5852 | 5.8772 | 5.6552 | | squeezenet1_1 | 32 | 0.2277 | 0.7507 | 1.1565 | 4.9352 | 4.9442 | 4.6576 | | drq | 1 | 0.3171 | 0.5537 | 0.9006 | 4.8473 | 3.7415 | 3.1367 | | vgg16 | 64 | 0.1798 | 0.5247 | 0.8171 | 3.5834 | 3.6677 | 3.5196 | | soft_actor_critic | 256 | 0.2142 | 0.3218 | 0.5181 | 1.8413 | 3.3537 | 2.7274 | | alexnet | 128 | 0.1542 | 0.3788 | 0.5814 | 3.1527 | 3.3458 | 3.2117 | | nvidia_deeprecommender | 256 | 0.1951 | 0.3804 | 0.6131 | 5.13 | 3.2883 | 2.9606 | | dlrm | 1024 | 0.4419 | 0.771 | nan | 3.548 | 3.1432 | 2.7883 | | lennard_jones | 1000 | 0.1412 | 0.2627 | 0.4069 | 1.594 | 1.9188 | 1.7528 | | tts_angular | 64 | 0.1796 | 0.2231 | 0.3504 | 1.1938 | 1.8065 | 1.6009 | | demucs | 4 | 0.3016 | 0.3032 | 0.3055 | 0.3 | 0.211 | 0.2091 | | tacotron2 | 64 | 5.0041 | 16.5363 | nan | 50.9371 | nan | 45.3063 | | hf_GPT2_large | 4 | 5.1875 | 17.4989 | nan | nan | nan | 44.8628 | | hf_T5 | 8 | 2.538 | 8.473 | nan | 70.725 | nan | 27.9611 | | functorch_dp_cifar10 | 64 | 0.3049 | 1.218 | 1.8147 | nan | nan | nan | | hf_BigBird | 2 | 3.3012 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5274 | 1.5274 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4604 | 1.4604 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2635 | 0.988 | 1.3048 | 1.3922 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1744 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2891 | 1.133 | 1.1153 | 1.1438 | | Super_SloMo | 6 | 1.0024 | 0.9018 | nan | 0.9455 | 1.1137 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 1.0433 | 1.1066 | | speech_transformer | 32 | 0.9982 | 0.9772 | 0.2739 | 1.1209 | 1.0376 | 1.0444 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 0.9436 | 1.0969 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9405 | 1.0771 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8549 | 0.9229 | 1.1042 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9162 | 0.392 | 0.8945 | 0.9173 | 0.9986 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9107 | 1.0395 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9066 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.88 | 0.3236 | 0.7926 | 0.8982 | 1.0023 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | 0.8098 | 0.8675 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8643 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.8484 | 1.0138 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8779 | 0.3389 | 1.0865 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3502 | 1.1284 | 0.826 | 1.0815 | | dlrm | 1024 | 0.8149 | 0.8149 | nan | 0.8147 | 0.7932 | 0.8152 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3309 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.7501 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.947 | 0.716 | 0.3392 | 0.6268 | 0.6882 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9996 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5647 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4213 | 0.4334 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1617 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1436 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 1024 | 131.3396 | 262.2964 | nan | 243.5252 | 221.0894 | 166.9172 | | timm_vision_transformer_large | 8 | 196.4318 | 198.5834 | nan | nan | 187.7897 | 191.2849 | | timm_nfnet | 128 | 206.1267 | 205.652 | nan | 181.3598 | 172.0065 | 176.836 | | Background_Matting | 4 | 186.5542 | 972.5201 | nan | nan | 171.8111 | 173.0113 | | mobilenet_v2_quantized_qat | 96 | 147.0334 | 150.3501 | nan | 100.9331 | 141.1778 | 133.8722 | | hf_T5_large | 2 | 188.3722 | 216.012 | nan | nan | 119.6328 | 123.9528 | | Super_SloMo | 6 | 117.7144 | 484.8959 | nan | 474.5823 | 101.947 | 103.2686 | | yolov3 | 16 | 102.389 | 102.8163 | 128.9671 | 88.6962 | 95.3473 | 95.2387 | | resnet50_quantized_qat | 32 | 93.4018 | 96.1118 | nan | 84.4803 | 93.6776 | 95.8215 | | vgg16 | 64 | 106.4235 | 106.7549 | 123.9614 | 106.4939 | 90.6639 | 91.0214 | | timm_regnet | 32 | 101.666 | 101.4375 | 125.5417 | 89.6262 | 86.8744 | 89.6436 | | demucs | 4 | 77.307 | 77.625 | 77.6347 | 77.3924 | 77.9556 | 77.61 | | hf_Reformer | 4 | 83.2294 | 83.107 | 84.0625 | 113.1966 | 73.4468 | 73.0528 | | resnet152 | 32 | 90.376 | 85.1661 | 113.2881 | 98.8253 | 73.3833 | 76.2889 | | attention_is_all_you_need_pytorch | 256 | 71.9685 | 74.1018 | 96.8467 | 75.3105 | 68.61 | 69.3875 | | mobilenet_v2 | 96 | 71.3884 | 71.512 | 97.8565 | 53.3759 | 49.9665 | 50.9653 | | pytorch_unet | 1 | 58.5325 | 208.1423 | nan | nan | 49.1007 | 49.6728 | | hf_Bart | 4 | 54.133 | 56.3964 | 74.4016 | 64.2946 | 45.9978 | 46.0588 | | hf_Albert | 8 | 75.8181 | 75.4692 | 99.7926 | 48.1405 | 45.6344 | 45.6963 | | fastNLP_Bert | 6 | 59.6828 | 62.0301 | 79.4697 | 53.2528 | 42.3655 | 42.944 | | timm_vovnet | 32 | 42.2972 | 42.5191 | 54.0406 | 42.9857 | 38.4442 | 37.6603 | | speech_transformer | 32 | 48.7821 | 55.3164 | 35.6084 | 68.8314 | 35.8197 | 35.6346 | | timm_efficientdet | 1 | 140.9728 | 156.6612 | 76.773 | 184.1892 | 35.382 | 90.6052 | | hf_GPT2 | 4 | 49.6119 | 51.0867 | 68.7224 | 126.065 | 33.6402 | 33.5841 | | hf_DistilBert | 8 | 38.7931 | 40.5474 | 56.6235 | 74.0338 | 33.1149 | 32.8511 | | hf_Bert | 4 | 37.9827 | 42.3241 | 53.4469 | 45.4154 | 32.3527 | 32.9495 | | timm_efficientnet | 32 | 44.9077 | 52.4064 | 61.1897 | 53.1857 | 32.1653 | 36.3443 | | resnet50 | 32 | 38.6432 | 38.8401 | 50.9034 | 39.0609 | 32.1177 | 33.0274 | | shufflenet_v2_x1_0 | 128 | 36.7888 | 35.1458 | 45.7046 | 41.0641 | 23.4348 | 26.1744 | | BERT_pytorch | 16 | 46.5169 | 52.2918 | 42.1646 | 49.4009 | 22.9686 | 23.0739 | | timm_resnest | 32 | 31.7068 | 31.5634 | 39.3001 | 27.477 | 20.8114 | 21.737 | | mnasnet1_0 | 32 | 28.3439 | 25.837 | 33.2999 | 30.9297 | 19.4585 | 22.7293 | | pytorch_stargan | 16 | 24.2414 | 22.4605 | 25.909 | nan | 19.1033 | 19.6918 | | mobilenet_v3_large | 32 | 31.0174 | 30.4427 | 30.4049 | 36.9833 | 16.0752 | 23.5911 | | resnext50_32x4d | 8 | 26.8076 | 24.1596 | 22.0719 | 32.8811 | 13.4278 | 22.9402 | | densenet121 | 4 | 63.9358 | 63.4363 | 30.9024 | 94.8994 | 12.9162 | 53.0182 | | LearningToPaint | 96 | 15.3774 | 14.8804 | 17.9969 | 16.0785 | 12.3702 | 13.3306 | | alexnet | 128 | 12.4807 | 12.4731 | 15.447 | 12.3489 | 10.717 | 10.7418 | | nvidia_deeprecommender | 256 | 8.5365 | 8.8562 | 14.5756 | 8.7389 | 10.2007 | 9.645 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.4856 | 16.2926 | 13.5818 | 19.8968 | 9.6207 | 10.8811 | | timm_vision_transformer | 8 | 23.8379 | 25.3605 | 15.7231 | 36.9064 | 9.4292 | 17.7319 | | tts_angular | 64 | 9.3797 | 9.7405 | 9.5819 | 9.6038 | 9.2674 | 9.2285 | | squeezenet1_1 | 32 | 12.6299 | 12.3261 | 12.0044 | 15.4423 | 6.9799 | 9.7858 | | resnet18 | 16 | 11.9884 | 10.781 | 10.2648 | 13.8713 | 6.5985 | 10.9989 | | pytorch_struct | 200 | 3.9016 | 4.9867 | 4.2389 | 5.7944 | 2.1033 | 3.3844 | | dcgan | 32 | 2.6081 | 2.5954 | 2.0946 | 3.3268 | 1.3437 | 2.7925 | | drq | 1 | 2.8766 | 3.4849 | 1.763 | 5.085 | 1.208 | 2.777 | | soft_actor_critic | 256 | 1.0339 | 1.2962 | 0.9633 | 1.518 | 0.7362 | 1.1201 | | lennard_jones | 1000 | 1.1208 | 1.2449 | 1.0849 | 1.5836 | 0.628 | 1.1773 | | tacotron2 | 64 | 3295.0119 | 3182.1471 | nan | 3535.8436 | nan | 3173.9951 | | hf_GPT2_large | 4 | 240.6403 | 245.2311 | nan | nan | nan | 163.692 | | hf_T5 | 8 | 182.9764 | 192.6528 | nan | 156.1455 | nan | 118.0649 | | functorch_dp_cifar10 | 64 | 11.2611 | 11.1915 | 5.4021 | nan | nan | nan | | hf_BigBird | 2 | 194.3611 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 1.0001 | 0.9319 | 0.0 | 0.8867 | 1.8033 | 1.813 | | GPT2ForSequenceClassification | 4 | 1.0 | 0.978 | 0.0 | 0.6982 | 1.772 | 1.7587 | | XLNetLMHeadModel | 8 | 0.997 | 0.9694 | 0.0 | 0.0 | 1.6849 | 1.703 | | MT5ForConditionalGeneration | 16 | 1.0233 | 0.9327 | 0.9809 | 1.0378 | 1.5508 | 1.5117 | | GoogleFnet | 16 | 0.9996 | 0.9989 | 0.0 | 1.5403 | 1.4473 | 1.5594 | | DistillGPT2 | 16 | 0.9999 | 0.9527 | 0.0 | 0.9215 | 1.4385 | 1.4856 | | ElectraForQuestionAnswering | 64 | 1.0009 | 0.9852 | 0.0 | 1.1939 | 1.4273 | 1.4079 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.936 | 0.7256 | 1.1066 | 1.4211 | 1.4042 | | T5Small | 4 | 1.0013 | 0.9382 | 0.725 | 1.0938 | 1.4177 | 1.4141 | | MobileBertForMaskedLM | 64 | 1.0207 | 0.936 | 0.8423 | 0.0 | 1.4161 | 1.185 | | ElectraForCausalLM | 32 | 1.0008 | 0.9228 | 0.0 | 1.0178 | 1.4112 | 1.4507 | | MobileBertForQuestionAnswering | 128 | 1.0235 | 0.9578 | 0.0 | 0.0 | 1.3639 | 1.1182 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9891 | 0.7376 | 1.117 | 1.312 | 1.2908 | | BertForQuestionAnswering | 16 | 1.0005 | 0.9898 | 0.733 | 1.1142 | 1.2874 | 1.2671 | | RobertaForQuestionAnswering | 16 | 1.0002 | 0.9887 | 0.7274 | 1.1143 | 1.2862 | 1.2733 | | RobertaForCausalLM | 16 | 1.0003 | 0.9715 | 0.0 | 1.0565 | 1.2738 | 1.2754 | | AlbertForQuestionAnswering | 4 | 1.0005 | 1.0021 | 0.0 | 1.2356 | 1.2606 | 1.2565 | | AlbertForMaskedLM | 4 | 1.0004 | 1.0 | 0.0 | 1.2295 | 1.2523 | 1.2503 | | MegatronBertForQuestionAnswering | 8 | 1.0003 | 0.9922 | 0.0 | 1.0791 | 1.2196 | 1.2046 | | MegatronBertForCausalLM | 4 | 1.001 | 0.9847 | 0.7256 | 0.9686 | 1.2122 | 1.1977 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9709 | 0.0 | 1.0622 | 1.1928 | 1.198 | | BertForMaskedLM | 16 | 1.0001 | 0.97 | 0.0 | 1.0623 | 1.1753 | 1.1799 | | YituTechConvBert | 16 | 0.9999 | 0.9684 | 0.0 | 1.0031 | 1.1736 | 1.1738 | | CamemBert | 16 | 0.9999 | 0.9695 | 0.0 | 1.0604 | 1.1711 | 1.1763 | | PLBartForConditionalGeneration | 4 | 1.0001 | 0.9596 | 0.0 | 0.9564 | 1.1588 | 1.1669 | | DistilBertForQuestionAnswering | 256 | 1.0001 | 1.0002 | 0.0 | 0.7889 | 1.1586 | 1.1525 | | XGLMForCausalLM | 8 | 1.0111 | 0.9426 | 0.7624 | 0.3096 | 1.1521 | 1.1603 | | PLBartForCausalLM | 8 | 1.0003 | 0.9516 | 0.0 | 0.9617 | 1.128 | 1.1826 | | M2M100ForConditionalGeneration | 16 | 1.0932 | 0.9594 | 0.0 | 1.0325 | 1.1112 | 1.1501 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9879 | 0.0 | 1.0198 | 1.0973 | 1.0883 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9884 | 0.0 | 0.4476 | 1.094 | 1.0851 | | MBartForCausalLM | 4 | 1.0009 | 0.9658 | 0.7543 | 0.9987 | 1.0856 | 1.0936 | | BartForCausalLM | 4 | 1.0003 | 0.9666 | 0.7554 | 0.9995 | 1.0826 | 1.0918 | | DebertaForMaskedLM | 4 | 0.9006 | 0.7927 | 0.7249 | 0.6355 | 1.0765 | 1.0393 | | DebertaForQuestionAnswering | 8 | 0.997 | 0.9827 | 0.6825 | 0.8612 | 1.048 | 1.2174 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9406 | 0.0 | 0.9521 | 1.0352 | 1.042 | | DistilBertForMaskedLM | 128 | 0.9994 | 0.953 | 0.0 | 0.8054 | 1.0155 | 1.0299 | | PegasusForConditionalGeneration | 32 | 0.9993 | 0.9806 | 0.0 | 0.9798 | 1.0155 | 1.0097 | | DebertaV2ForMaskedLM | 1 | 0.8689 | 0.7422 | 0.0 | 0.0 | 1.0071 | 0.8461 | | Speech2Text2ForCausalLM | 256 | 0.9988 | 0.925 | 0.6531 | 0.9407 | 0.9905 | 1.0285 | | PegasusForCausalLM | 32 | 0.9991 | 0.9536 | 0.7325 | 0.9525 | 0.9721 | 0.9813 | | BlenderbotSmallForCausalLM | 64 | 1.0026 | 0.9103 | 0.6825 | 0.9138 | 0.9571 | 0.9876 | | DebertaV2ForQuestionAnswering | 2 | 0.8912 | 0.8307 | 0.0 | 0.62 | 0.9268 | 0.9129 | | TrOCRForCausalLM | 32 | 0.9996 | 0.9565 | 0.0 | 0.966 | 0.0 | 1.0236 | | BlenderbotForCausalLM | 4 | 1.003 | 0.9819 | 0.0 | 0.9494 | 0.0 | 1.015 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XLNetLMHeadModel | 8 | 4.6561 | 18.6303 | nan | nan | 138.2006 | 130.5974 | | DebertaV2ForQuestionAnswering | 2 | 7.3799 | 16.4677 | nan | 97.5873 | 132.9739 | 48.9509 | | DebertaV2ForMaskedLM | 1 | 7.4673 | 16.4867 | nan | nan | 130.6191 | 50.5206 | | DebertaForMaskedLM | 4 | 4.6728 | 10.6194 | 33.7745 | 74.374 | 86.1345 | 36.1223 | | DebertaForQuestionAnswering | 8 | 4.6892 | 10.6492 | 34.2765 | 67.5962 | 82.2699 | 36.6154 | | XGLMForCausalLM | 8 | 2.5224 | 11.3989 | 22.9746 | 166.5705 | 71.3648 | 71.3285 | | MobileBertForQuestionAnswering | 128 | 8.6099 | 25.9459 | nan | nan | 56.9964 | 53.855 | | MobileBertForMaskedLM | 64 | 8.6208 | 25.8642 | 44.5196 | nan | 54.9197 | 52.8785 | | MT5ForConditionalGeneration | 16 | 3.7482 | 12.2445 | 19.1306 | 110.002 | 51.7237 | 50.9848 | | M2M100ForConditionalGeneration | 16 | 2.9885 | 14.7654 | nan | 172.8591 | 50.5416 | 47.6943 | | BartForConditionalGeneration | 2 | 3.2356 | 13.9214 | nan | 203.1599 | 45.6285 | 44.587 | | PegasusForConditionalGeneration | 32 | 2.9632 | 13.6824 | nan | 204.4906 | 44.9014 | 41.0066 | | MBartForConditionalGeneration | 2 | 3.3204 | 14.4495 | nan | 225.9411 | 43.9415 | 42.4238 | | YituTechConvBert | 16 | 2.3187 | 9.1951 | nan | 107.3293 | 39.1815 | 36.4325 | | MegatronBertForCausalLM | 4 | 3.4112 | 12.0395 | 18.4774 | 172.7314 | 34.7484 | 33.3358 | | MegatronBertForQuestionAnswering | 8 | 3.3895 | 11.736 | nan | 165.4508 | 34.4286 | 32.6166 | | T5ForConditionalGeneration | 4 | 2.5695 | 8.2851 | 12.9675 | 72.1388 | 31.173 | 30.1392 | | T5Small | 4 | 2.5586 | 8.3682 | 12.7269 | 74.0531 | 31.0556 | 30.4602 | | BlenderbotSmallForConditionalGeneration | 64 | 2.0492 | 9.1993 | nan | 133.6497 | 30.5982 | 30.3009 | | LayoutLMForSequenceClassification | 16 | 1.9501 | 6.5953 | 9.7023 | 79.7893 | 28.5088 | 27.2448 | | PLBartForConditionalGeneration | 4 | 1.6368 | 7.3634 | nan | 108.089 | 26.9002 | 26.0309 | | GoogleFnet | 16 | 0.9318 | 3.0193 | nan | 43.495 | 26.8565 | 19.0745 | | ElectraForCausalLM | 32 | 1.6225 | 6.1531 | nan | 77.693 | 26.6006 | 24.5567 | | PegasusForCausalLM | 32 | 1.2347 | 5.336 | 8.473 | 79.4205 | 22.4131 | 21.0852 | | MBartForCausalLM | 4 | 1.2626 | 5.4075 | 8.0424 | 82.1473 | 21.7718 | 20.9931 | | LayoutLMForMaskedLM | 16 | 1.916 | 6.4188 | nan | 82.6718 | 21.5891 | 20.7609 | | BertForMaskedLM | 16 | 1.5672 | 5.9267 | nan | 79.936 | 21.3769 | 20.1333 | | ElectraForQuestionAnswering | 64 | 1.6093 | 6.1943 | nan | 78.1847 | 20.7136 | 20.2141 | | RobertaForCausalLM | 16 | 1.6118 | 6.0443 | nan | 83.3711 | 20.2211 | 19.494 | | BertForQuestionAnswering | 16 | 1.5574 | 6.1191 | 8.6049 | 76.5355 | 19.9376 | 19.7849 | | BartForCausalLM | 4 | 1.2563 | 5.3331 | 8.1716 | 76.3865 | 19.9363 | 19.4326 | | CamemBert | 16 | 1.6309 | 5.9829 | nan | 82.0165 | 19.4646 | 19.1925 | | RobertaForQuestionAnswering | 16 | 1.5573 | 5.9879 | 8.7823 | 81.1488 | 19.3593 | 18.6263 | | OPTForCausalLM | 2 | 1.3226 | 5.4692 | nan | 71.7025 | 18.2551 | 17.1555 | | GPT2ForSequenceClassification | 4 | 1.5073 | 5.5543 | nan | 62.3742 | 16.9617 | 16.2604 | | AlbertForMaskedLM | 4 | 1.3534 | 5.3137 | nan | 107.8483 | 15.8749 | 15.8669 | | AlbertForQuestionAnswering | 4 | 1.3924 | 5.3568 | nan | 104.0931 | 15.4587 | 15.4637 | | BlenderbotSmallForCausalLM | 64 | 0.8318 | 3.6648 | 5.2944 | 52.1539 | 14.3391 | 14.0226 | | DistillGPT2 | 16 | 0.8096 | 2.9767 | nan | 34.9997 | 13.904 | 13.1668 | | Speech2Text2ForCausalLM | 256 | 0.7424 | 2.8969 | 4.8886 | 40.3979 | 13.4265 | 12.6263 | | PLBartForCausalLM | 8 | 0.7166 | 2.8202 | nan | 43.9236 | 13.2033 | 12.5093 | | DistilBertForMaskedLM | 128 | 0.6737 | 3.0081 | nan | 45.6007 | 11.575 | 11.0144 | | DistilBertForQuestionAnswering | 256 | 0.7112 | 2.9815 | nan | 40.6519 | 10.8292 | 10.2968 | | BlenderbotForCausalLM | 4 | 2.3459 | 10.5679 | nan | 161.0645 | nan | 38.6227 | | TrOCRForCausalLM | 32 | 1.2519 | 5.3882 | nan | 74.7069 | nan | 19.4223 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | XLNetLMHeadModel | 8 | 1.0 | 0.9323 | nan | nan | 0.9946 | 0.9946 | | GoogleFnet | 16 | 0.9224 | 0.9224 | nan | 1.4614 | 0.9608 | 1.2768 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9344 | nan | 1.274 | 0.9316 | 1.2234 | | OPTForCausalLM | 2 | 1.0001 | 0.9258 | nan | 1.0746 | 0.9068 | 1.1143 | | YituTechConvBert | 16 | 0.9966 | 0.9341 | nan | 0.9891 | 0.894 | 0.9822 | | DistillGPT2 | 16 | 1.0 | 0.8855 | nan | 1.055 | 0.8939 | 1.0108 | | M2M100ForConditionalGeneration | 16 | 1.0078 | 0.9084 | nan | 1.0113 | 0.8659 | 1.0404 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4039 | | PegasusForConditionalGeneration | 32 | 0.9981 | 0.9529 | nan | 1.1152 | 0.8637 | 1.0262 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | 0.842 | 1.3737 | | PLBartForCausalLM | 8 | 1.0 | 0.8896 | nan | 1.0988 | 0.8367 | 1.0581 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | T5Small | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.844 | 0.7929 | 0.9036 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8931 | nan | 0.9681 | 0.7896 | 0.9837 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 1.0402 | 0.7774 | 0.9692 | | MT5ForConditionalGeneration | 16 | 1.0014 | 0.8793 | 0.4388 | 0.9365 | 0.7748 | 0.9324 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.958 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9223 | nan | 1.0616 | 0.7709 | 1.0379 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9018 | 0.3475 | 0.9999 | 0.7673 | 1.0153 | | MBartForCausalLM | 4 | 1.0 | 0.9122 | 0.3642 | 1.0011 | 0.7326 | 0.9478 | | BertForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0274 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | BertForMaskedLM | 16 | 1.0 | 0.9408 | nan | 0.9928 | 0.6945 | 0.9772 | | CamemBert | 16 | 1.0 | 0.9388 | nan | 0.987 | 0.6942 | 0.9746 | | RobertaForCausalLM | 16 | 1.0 | 0.9405 | nan | 0.9926 | 0.6942 | 0.9771 | | Speech2Text2ForCausalLM | 256 | 0.9545 | 0.8748 | 0.3515 | 0.9068 | 0.675 | 0.9168 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9602 | nan | 1.1897 | 0.6589 | 0.9118 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8847 | nan | 0.8827 | 0.6509 | 0.9194 | | DebertaV2ForMaskedLM | 1 | 1.0 | 0.9651 | nan | nan | 0.5682 | 0.9491 | | MobileBertForMaskedLM | 64 | 1.0 | 0.906 | 0.3175 | nan | 0.4951 | 0.6649 | | DebertaV2ForQuestionAnswering | 2 | 0.9842 | 0.9842 | nan | 0.9842 | 0.4735 | 0.984 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9909 | nan | nan | 0.4145 | 0.535 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3554 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | TrOCRForCausalLM | 32 | 1.0 | 0.8787 | nan | 0.9998 | nan | 0.9239 | | BlenderbotForCausalLM | 4 | 1.0001 | 0.8057 | nan | 0.8218 | nan | 0.8509 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 384.8858 | 384.3916 | nan | 311.4306 | 309.0079 | 308.5419 | | AlbertForQuestionAnswering | 4 | 380.5019 | 380.7837 | nan | 307.8059 | 304.5128 | 304.2862 | | XLNetLMHeadModel | 8 | 375.0616 | 386.0647 | nan | nan | 220.4311 | 219.9121 | | PegasusForConditionalGeneration | 32 | 176.0379 | 179.5955 | nan | 179.2592 | 173.8705 | 174.0042 | | MegatronBertForQuestionAnswering | 8 | 171.8313 | 173.4356 | nan | 159.6495 | 141.2831 | 142.8988 | | BartForConditionalGeneration | 2 | 149.8032 | 151.6276 | nan | 335.4358 | 137.1948 | 138.2866 | | MBartForConditionalGeneration | 2 | 149.6439 | 151.7363 | nan | 146.6858 | 136.7712 | 137.54 | | YituTechConvBert | 16 | 155.0437 | 160.0814 | nan | 154.5496 | 132.5049 | 132.2456 | | DistilBertForQuestionAnswering | 256 | 144.4821 | 144.5117 | nan | 183.0799 | 124.7987 | 125.5174 | | MobileBertForQuestionAnswering | 128 | 140.7524 | 148.0984 | nan | nan | 124.7662 | 129.1776 | | DistilBertForMaskedLM | 128 | 122.1443 | 128.0427 | nan | 152.1953 | 120.4144 | 118.5154 | | MobileBertForMaskedLM | 64 | 153.5487 | 158.9376 | 193.8995 | nan | 119.7632 | 126.2742 | | CamemBert | 16 | 135.7214 | 139.8179 | nan | 127.7555 | 115.818 | 115.225 | | BlenderbotSmallForConditionalGeneration | 64 | 118.8063 | 126.6818 | nan | 125.2766 | 115.2392 | 114.4605 | | LayoutLMForMaskedLM | 16 | 136.8107 | 140.8743 | nan | 128.8095 | 114.7526 | 114.2045 | | BertForMaskedLM | 16 | 134.207 | 138.3069 | nan | 126.3021 | 114.4444 | 113.6933 | | DebertaV2ForQuestionAnswering | 2 | 118.5826 | 127.1196 | nan | 170.0926 | 114.1004 | 115.4305 | | BartForCausalLM | 4 | 123.3865 | 127.4222 | 163.4244 | 123.4654 | 114.0007 | 112.8948 | | MBartForCausalLM | 4 | 123.1771 | 127.4938 | 163.5867 | 123.4051 | 113.7047 | 112.5709 | | RobertaForCausalLM | 16 | 142.3409 | 146.6326 | nan | 134.9218 | 111.9876 | 111.7492 | | PLBartForConditionalGeneration | 4 | 120.4752 | 125.1769 | nan | 127.079 | 104.9356 | 104.3197 | | M2M100ForConditionalGeneration | 16 | 106.3645 | 122.4111 | nan | 111.7105 | 104.0437 | 102.0933 | | PLBartForCausalLM | 8 | 117.9346 | 123.9433 | nan | 120.2813 | 102.1502 | 99.6524 | | OPTForCausalLM | 2 | 168.7798 | 181.1501 | nan | 190.087 | 93.8235 | 93.0284 | | PegasusForCausalLM | 32 | 85.5191 | 89.6103 | 116.8134 | 89.7136 | 88.3722 | 87.119 | | DebertaV2ForMaskedLM | 1 | 101.2218 | 119.345 | nan | nan | 87.6608 | 103.9377 | | ElectraForQuestionAnswering | 64 | 125.5647 | 126.6205 | nan | 104.4675 | 87.4549 | 88.5558 | | LayoutLMForSequenceClassification | 16 | 112.9698 | 114.3612 | 153.2764 | 101.1932 | 86.4036 | 87.6333 | | RobertaForQuestionAnswering | 16 | 110.9471 | 112.0465 | 152.612 | 99.5075 | 86.3814 | 87.0417 | | BertForQuestionAnswering | 16 | 110.4812 | 111.6704 | 150.8752 | 99.1167 | 86.0589 | 87.2968 | | MegatronBertForCausalLM | 4 | 101.8618 | 103.4551 | 141.0305 | 105.1971 | 84.1742 | 85.073 | | DistillGPT2 | 16 | 120.628 | 126.5856 | nan | 130.9344 | 83.9721 | 81.2399 | | DebertaForQuestionAnswering | 8 | 82.1527 | 83.0723 | 119.7327 | 94.52 | 77.9357 | 66.9424 | | ElectraForCausalLM | 32 | 105.6767 | 114.6459 | nan | 103.8448 | 75.0727 | 72.8313 | | T5ForConditionalGeneration | 4 | 103.8908 | 111.1779 | 143.583 | 94.2688 | 74.0042 | 73.8968 | | T5Small | 4 | 104.0139 | 111.06 | 144.5832 | 94.9199 | 73.947 | 73.5931 | | XGLMForCausalLM | 8 | 78.9498 | 84.9566 | 108.9176 | 257.1195 | 70.2604 | 69.9426 | | GoogleFnet | 16 | 101.6556 | 101.7364 | nan | 66.0229 | 70.258 | 65.1254 | | BlenderbotSmallForCausalLM | 64 | 64.8162 | 70.9363 | 94.694 | 71.5099 | 67.6922 | 65.4556 | | Speech2Text2ForCausalLM | 256 | 63.7984 | 68.9865 | 97.9061 | 68.0468 | 64.5403 | 62.0913 | | MT5ForConditionalGeneration | 16 | 88.1631 | 96.3581 | 97.0014 | 85.8661 | 58.047 | 59.9583 | | GPT2ForSequenceClassification | 4 | 102.1254 | 104.5018 | nan | 146.1337 | 57.621 | 58.0486 | | DebertaForMaskedLM | 4 | 72.5865 | 75.3012 | 82.9967 | 93.3286 | 55.838 | 57.7514 | | TrOCRForCausalLM | 32 | 167.1077 | 174.4865 | nan | 172.9842 | nan | 162.6365 | | BlenderbotForCausalLM | 4 | 92.9462 | 94.6176 | nan | 98.1592 | nan | 92.0118 | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9995 | 0.9737 | 0.8285 | 1.2824 | 1.8706 | 1.8304 | | lcnet_050 | 128 | 0.9566 | 0.9498 | 0.7693 | 1.3518 | 1.6614 | 1.6318 | | regnety_002 | 128 | 0.9786 | 1.0001 | 0.855 | 0.9647 | 1.4882 | 1.3444 | | hrnet_w18 | 128 | 0.9998 | 0.9986 | 0.0 | 1.2728 | 1.4211 | 1.3785 | | dla102 | 128 | 0.9999 | 1.0007 | 0.0 | 1.2836 | 1.3857 | 1.3655 | | volo_d1_224 | 64 | 1.0 | 0.9959 | 0.8034 | 0.0 | 1.3823 | 1.3623 | | res2net50_14w_8s | 128 | 0.9998 | 0.9996 | 0.0 | 1.2513 | 1.3556 | 1.3245 | | coat_lite_mini | 128 | 1.0 | 1.0002 | 0.845 | 1.0905 | 1.3515 | 1.3378 | | xcit_large_24_p8_224 | 5 | 1.0026 | 0.9889 | 0.7847 | 0.0 | 1.3461 | 1.3083 | | mobilenetv2_100 | 128 | 0.9657 | 0.9639 | 0.7073 | 1.287 | 1.336 | 1.355 | | mobilenetv3_large_100 | 128 | 0.9669 | 0.9602 | 0.7654 | 1.2864 | 1.331 | 1.3498 | | gluon_inception_v3 | 128 | 1.0001 | 0.9985 | 0.0 | 1.1286 | 1.3285 | 1.3067 | | inception_v3 | 128 | 1.0 | 0.9977 | 0.0 | 1.1291 | 1.3279 | 1.3103 | | adv_inception_v3 | 128 | 1.0 | 0.9991 | 0.0 | 1.1297 | 1.3279 | 1.3078 | | crossvit_9_240 | 128 | 0.9996 | 0.9982 | 0.7603 | 1.0395 | 1.3257 | 1.2942 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.1815 | 1.3114 | 1.2748 | | resnest101e | 64 | 0.9997 | 1.0038 | 0.0 | 1.1689 | 1.3109 | 1.2697 | | fbnetv3_b | 128 | 0.9645 | 0.961 | 0.7619 | 1.2402 | 1.2821 | 1.2986 | | botnet26t_256 | 128 | 0.9856 | 0.9852 | 0.79 | 0.0 | 1.2769 | 1.2727 | | eca_botnext26ts_256 | 128 | 0.9869 | 0.7728 | 0.0 | 0.0 | 1.2719 | 1.2567 | | selecsls42b | 128 | 0.9999 | 0.9987 | 0.8156 | 1.2163 | 1.2678 | 1.2536 | | sebotnet33ts_256 | 64 | 0.9762 | 0.8065 | 0.0 | 0.0 | 1.2674 | 1.2704 | | mnasnet_100 | 128 | 0.9657 | 0.9635 | 0.7856 | 1.2526 | 1.2584 | 1.2789 | | tf_efficientnet_b0 | 128 | 0.9774 | 0.7841 | 0.0 | 1.1642 | 1.2584 | 1.2693 | | fbnetc_100 | 128 | 0.9667 | 0.9632 | 0.7913 | 1.2461 | 1.2488 | 1.2664 | | eca_halonext26ts | 128 | 0.987 | 0.7786 | 0.0 | 0.0 | 1.2476 | 1.2384 | | jx_nest_base | 32 | 0.9998 | 0.9951 | 0.7304 | 0.0 | 1.2475 | 1.2145 | | gmixer_24_224 | 128 | 0.9998 | 0.8347 | 0.0 | 1.0825 | 1.247 | 1.2818 | | ese_vovnet19b_dw | 128 | 0.9794 | 0.9781 | 0.7451 | 1.1501 | 1.2391 | 1.2469 | | spnasnet_100 | 128 | 0.9618 | 0.9554 | 0.7745 | 1.2159 | 1.2358 | 1.2562 | | cspdarknet53 | 64 | 0.958 | 0.9546 | 0.7368 | 1.1738 | 1.2348 | 1.2467 | | res2net101_26w_4s | 64 | 0.9999 | 0.996 | 0.7734 | 1.1161 | 1.2272 | 1.1885 | | convit_base | 64 | 0.9997 | 0.9989 | 0.0 | 0.0 | 1.2165 | 1.2044 | | cait_m36_384 | 4 | 0.9997 | 0.9986 | 0.0 | 0.0 | 1.2158 | 1.1883 | | gmlp_s16_224 | 128 | 0.9998 | 0.9994 | 0.0 | 1.0918 | 1.2151 | 1.2015 | | rexnet_100 | 128 | 0.9734 | 0.8169 | 0.0 | 1.1609 | 1.2142 | 1.216 | | pnasnet5large | 16 | 0.9996 | 0.9983 | 0.0 | 1.0894 | 1.2088 | 1.1912 | | dm_nfnet_f0 | 128 | 0.9996 | 1.0004 | 0.0 | 1.1376 | 1.1934 | 1.1609 | | dpn107 | 32 | 0.9583 | 0.9511 | 0.7805 | 1.0223 | 1.1898 | 1.2022 | | tf_mixnet_l | 128 | 0.9857 | 0.8899 | 0.0 | 1.0947 | 1.1889 | 1.1865 | | tinynet_a | 128 | 0.966 | 0.776 | 0.6212 | 1.1475 | 1.1868 | 1.1961 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.0321 | 1.186 | 1.175 | | twins_pcpvt_base | 64 | 0.9999 | 0.9987 | 0.7503 | 0.0 | 1.1775 | 1.1497 | | mobilevit_s | 64 | 0.9801 | 0.762 | 0.0 | 0.0 | 1.1751 | 1.1728 | | poolformer_m36 | 64 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1702 | 1.1511 | | repvgg_a2 | 128 | 0.9646 | 0.9636 | 0.8272 | 1.1369 | 1.1701 | 1.1685 | | mixnet_l | 128 | 0.9846 | 0.8861 | 0.0 | 1.098 | 1.1688 | 1.1738 | | nfnet_l0 | 128 | 1.0003 | 0.7886 | 0.0 | 1.1054 | 1.1574 | 1.1174 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9792 | 0.0 | 0.0 | 1.1244 | 1.1167 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9821 | 0.0 | 0.0 | 1.1123 | 1.1004 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9988 | 0.0 | 1.1095 | 1.11 | 1.0714 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9991 | 0.7679 | 0.9807 | 1.0939 | 1.0845 | | gluon_xception65 | 32 | 0.9999 | 0.9976 | 0.0 | 1.0808 | 1.0881 | 1.075 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9991 | 0.7678 | 0.9507 | 1.0853 | 1.0709 | | convmixer_768_32 | 32 | 0.9999 | 0.9999 | 0.0 | 0.0 | 1.0765 | 1.0734 | | gernet_l | 128 | 0.9742 | 0.9727 | 0.8248 | 1.0982 | 1.0753 | 1.0715 | | mixer_b16_224 | 128 | 0.999 | 0.9999 | 0.0 | 0.8918 | 1.0711 | 1.0641 | | visformer_small | 128 | 0.9997 | 1.0027 | 0.7977 | 0.0 | 1.0433 | 1.0085 | | convnext_base | 64 | 0.9999 | 0.9987 | 0.0 | 0.0 | 1.0399 | 1.0297 | | resmlp_12_224 | 128 | 0.9999 | 1.0009 | 0.695 | 1.2074 | 0.9873 | 0.9524 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.5123 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.3685 | 28.0708 | nan | 589.8092 | 111.9136 | 107.4751 | | xcit_large_24_p8_224 | 5 | 2.9538 | 16.6361 | 27.8732 | nan | 81.1279 | 78.4685 | | twins_pcpvt_base | 64 | 2.2728 | 12.1824 | 20.1215 | nan | 79.241 | 76.8825 | | swin_base_patch4_window7_224 | 64 | 2.7376 | 12.0076 | nan | nan | 76.9816 | 75.6852 | | mobilevit_s | 64 | 1.7004 | 6.7945 | nan | nan | 76.2878 | 75.227 | | pnasnet5large | 16 | 4.6562 | 20.1125 | nan | 311.5273 | 73.8913 | 70.5931 | | cait_m36_384 | 4 | 3.051 | 16.5323 | nan | nan | 64.2441 | 61.7538 | | dm_nfnet_f0 | 128 | 2.161 | 7.1435 | nan | 145.1406 | 61.7311 | 60.6439 | | resnest101e | 64 | 3.2728 | 14.3192 | nan | 258.0909 | 58.6004 | 57.4501 | | coat_lite_mini | 128 | 1.0794 | 4.7211 | 7.0697 | 94.6367 | 57.4588 | 57.0241 | | res2net101_26w_4s | 64 | 3.1452 | 15.4078 | 25.1352 | 223.3428 | 54.7243 | 51.7715 | | jx_nest_base | 32 | 1.7729 | 8.3614 | 14.4654 | nan | 53.9989 | 51.9175 | | eca_halonext26ts | 128 | 1.5009 | 5.354 | nan | nan | 51.3829 | 49.4842 | | res2net50_14w_8s | 128 | 2.8154 | 14.0868 | nan | 240.0937 | 50.1291 | 47.7679 | | poolformer_m36 | 64 | 1.7557 | 7.0618 | nan | nan | 48.6925 | 46.3437 | | nfnet_l0 | 128 | 1.863 | 6.8 | nan | 128.8747 | 47.6199 | 46.8229 | | convnext_base | 64 | 1.3376 | 5.8223 | nan | nan | 45.6141 | 43.8009 | | dpn107 | 32 | 3.9937 | 12.6184 | 35.8603 | 163.996 | 43.0245 | 40.5799 | | sebotnet33ts_256 | 64 | 1.6361 | 5.9911 | nan | nan | 40.8077 | 40.3498 | | gmlp_s16_224 | 128 | 1.0478 | 6.0813 | nan | 136.4705 | 40.2473 | 38.7087 | | crossvit_9_240 | 128 | 1.4633 | 7.8059 | 11.1711 | 151.0799 | 37.6862 | 36.9067 | | fbnetv3_b | 128 | 3.2377 | 10.2436 | 25.9242 | 213.0883 | 36.9047 | 34.9009 | | gluon_xception65 | 32 | 1.9467 | 9.9773 | nan | 153.5465 | 36.6896 | 34.7356 | | volo_d1_224 | 64 | 1.2946 | 6.8493 | 10.6829 | nan | 36.6057 | 36.0427 | | eca_botnext26ts_256 | 128 | 1.4353 | 4.8445 | nan | nan | 34.4214 | 33.1422 | | gluon_inception_v3 | 128 | 1.6229 | 7.7293 | nan | 142.9188 | 33.682 | 31.7525 | | tf_mixnet_l | 128 | 5.6827 | 12.3617 | nan | 157.819 | 33.5952 | 32.9156 | | inception_v3 | 128 | 1.6307 | 7.7277 | nan | 139.9946 | 33.4428 | 31.7059 | | adv_inception_v3 | 128 | 1.6337 | 7.7596 | nan | 145.2061 | 33.1967 | 31.7352 | | ghostnet_100 | 128 | 2.9189 | 8.9743 | 13.1119 | 160.2584 | 33.0239 | 31.675 | | mixnet_l | 128 | 5.3968 | 11.9903 | nan | 156.0097 | 32.5856 | 30.7832 | | dla102 | 128 | 1.8582 | 8.6105 | nan | 170.617 | 31.7082 | 30.3626 | | gmixer_24_224 | 128 | 1.2069 | 6.6419 | nan | 129.6516 | 31.2173 | 29.7493 | | swsl_resnext101_32x16d | 32 | 1.7816 | 8.3356 | nan | 121.2739 | 30.9766 | 29.219 | | botnet26t_256 | 128 | 1.2935 | 3.9758 | 8.639 | nan | 30.8566 | 30.0014 | | convit_base | 64 | 1.1101 | 5.3246 | nan | nan | 29.0997 | 27.5367 | | res2next50 | 128 | 1.5966 | 7.5431 | nan | 155.5124 | 28.4663 | 27.4995 | | tinynet_a | 128 | 2.1849 | 7.3485 | 17.7447 | 147.7585 | 27.0751 | 26.2369 | | rexnet_100 | 128 | 1.9048 | 6.8719 | nan | 144.6068 | 26.8289 | 26.5836 | | convmixer_768_32 | 32 | 1.2143 | 5.6168 | nan | nan | 23.5143 | 21.609 | | cspdarknet53 | 64 | 2.2492 | 6.8691 | 17.4143 | 122.6502 | 23.3475 | 21.8947 | | tf_efficientnet_b0 | 128 | 1.7926 | 6.2769 | nan | 130.6088 | 22.9754 | 22.1609 | | fbnetc_100 | 128 | 2.0168 | 6.1327 | 16.0739 | 110.8801 | 22.0603 | 21.2017 | | mixer_b16_224 | 128 | 0.7147 | 2.966 | nan | 70.2827 | 21.9937 | 21.4186 | | visformer_small | 128 | 1.0011 | 3.8281 | 5.7378 | nan | 21.9619 | 21.4072 | | resmlp_12_224 | 128 | 0.6747 | 2.631 | 4.228 | 36.1381 | 21.8314 | 21.2158 | | spnasnet_100 | 128 | 1.9939 | 6.2337 | 15.7758 | 113.4782 | 21.7485 | 20.797 | | pit_b_224 | 64 | 1.0987 | 4.4384 | nan | 94.3097 | 20.6739 | 19.9663 | | beit_base_patch16_224 | 64 | 1.1938 | 5.2485 | nan | nan | 20.4461 | 19.5999 | | mobilenetv3_large_100 | 128 | 1.554 | 5.2732 | 12.309 | 119.41 | 20.4194 | 19.4433 | | deit_base_distilled_patch16_224 | 64 | 0.8116 | 3.8888 | 5.9981 | 73.401 | 19.9774 | 19.2909 | | mnasnet_100 | 128 | 1.5761 | 4.8965 | 12.3344 | 87.3183 | 19.2111 | 17.9014 | | vit_base_patch16_224 | 64 | 0.8726 | 3.9934 | 6.1841 | 72.8874 | 19.1226 | 18.8155 | | mobilenetv2_100 | 128 | 1.6085 | 5.2549 | 12.3388 | 97.8977 | 19.0676 | 18.389 | | repvgg_a2 | 128 | 1.9853 | 5.7357 | 14.5783 | 159.1925 | 18.8623 | 18.2905 | | gernet_l | 128 | 1.914 | 5.6788 | 14.2797 | 85.7372 | 18.476 | 17.464 | | regnety_002 | 128 | 1.627 | 5.0863 | 11.9324 | 94.3648 | 17.824 | 17.2162 | | selecsls42b | 128 | 0.7505 | 3.3936 | 5.1691 | 74.7937 | 15.9062 | 15.3103 | | lcnet_050 | 128 | 1.0415 | 3.1984 | 6.9095 | 68.8354 | 13.4814 | 12.8025 | | ese_vovnet19b_dw | 128 | 1.0231 | 2.8124 | 6.1993 | 56.146 | 13.305 | 12.0508 | | tnt_s_patch16_224 | 128 | 1.7166 | 9.2125 | nan | nan | nan | 33.3436 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | 0.9898 | 1.351 | 1.5843 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1877 | 1.3424 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1377 | 1.2737 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.2529 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.1264 | 1.3578 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1802 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.0689 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9927 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9882 | 1.0887 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0539 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9765 | 1.1445 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9653 | 1.0595 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9633 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9468 | 1.1098 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 0.9435 | 1.0967 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.988 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2864 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9068 | 1.0516 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9065 | 1.0615 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9057 | 0.9838 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8525 | 1.0752 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.8485 | 1.0335 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8189 | 0.9416 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8154 | 1.0235 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6571 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7449 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6742 | 0.9001 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8633 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 365.2274 | 365.2614 | nan | nan | 339.2699 | 340.185 | | hrnet_w18 | 128 | 416.2253 | 417.0103 | nan | 326.5343 | 293.3588 | 302.2165 | | convnext_base | 64 | 262.1159 | 262.3548 | nan | nan | 252.0175 | 254.3555 | | pnasnet5large | 16 | 289.3857 | 289.1286 | nan | 265.0226 | 239.2364 | 242.3503 | | tf_mixnet_l | 128 | 256.8385 | 284.4315 | nan | 231.3974 | 212.9643 | 213.2786 | | swin_base_patch4_window7_224 | 64 | 237.1538 | 241.9119 | nan | nan | 210.952 | 212.0296 | | mixnet_l | 128 | 247.5622 | 275.0362 | nan | 221.9389 | 208.5461 | 207.4511 | | swsl_resnext101_32x16d | 32 | 219.171 | 219.3709 | nan | 197.8528 | 197.7163 | 204.459 | | dla102 | 128 | 269.678 | 269.069 | nan | 209.9225 | 194.6265 | 197.2368 | | cait_m36_384 | 4 | 216.3834 | 216.2951 | nan | nan | 177.8365 | 182.0741 | | resnest101e | 64 | 230.762 | 229.7217 | nan | 197.3921 | 175.5883 | 181.3351 | | dm_nfnet_f0 | 128 | 206.4399 | 206.1735 | nan | 180.851 | 172.4815 | 177.1975 | | adv_inception_v3 | 128 | 226.4666 | 227.2055 | nan | 200.4713 | 170.6296 | 173.004 | | gluon_inception_v3 | 128 | 226.4911 | 226.8287 | nan | 201.125 | 170.6164 | 173.3706 | | inception_v3 | 128 | 226.3024 | 227.485 | nan | 200.5999 | 170.6098 | 173.0192 | | res2net50_14w_8s | 128 | 229.1413 | 229.405 | nan | 183.3061 | 169.0116 | 173.2046 | | gluon_xception65 | 32 | 182.8492 | 183.1218 | nan | 168.7967 | 168.0168 | 169.8382 | | convit_base | 64 | 196.2489 | 196.5178 | nan | nan | 161.5784 | 162.985 | | res2next50 | 128 | 206.9182 | 206.6474 | nan | 174.8818 | 157.6694 | 161.9801 | | dpn107 | 32 | 191.0936 | 192.357 | 234.2272 | 178.8549 | 153.6737 | 152.2419 | | nfnet_l0 | 128 | 176.5266 | 223.4915 | nan | 159.1062 | 152.0441 | 157.8254 | | gernet_l | 128 | 164.9622 | 165.378 | 193.2724 | 146.3909 | 149.6801 | 149.894 | | poolformer_m36 | 64 | 174.418 | 174.5273 | nan | nan | 149.1385 | 151.5425 | | mixer_b16_224 | 128 | 158.7331 | 158.0642 | nan | 176.9844 | 147.8137 | 148.6703 | | coat_lite_mini | 128 | 191.5913 | 191.3929 | 226.2726 | 175.5795 | 141.5394 | 142.9364 | | eca_halonext26ts | 128 | 169.3655 | 214.6665 | nan | nan | 133.8985 | 134.9387 | | pit_b_224 | 64 | 158.4011 | 158.4574 | nan | 153.3329 | 133.5524 | 134.7085 | | eca_botnext26ts_256 | 128 | 163.3409 | 208.684 | nan | nan | 126.8353 | 128.1575 | | gmlp_s16_224 | 128 | 151.9935 | 152.0673 | nan | 139.154 | 125.121 | 126.3265 | | res2net101_26w_4s | 64 | 151.7709 | 152.0687 | 195.9733 | 135.6413 | 123.6355 | 127.3466 | | visformer_small | 128 | 128.4152 | 128.1498 | 160.7732 | nan | 122.9709 | 127.2857 | | fbnetv3_b | 128 | 162.739 | 163.1552 | 205.9132 | 126.3813 | 122.4182 | 120.7129 | | botnet26t_256 | 128 | 152.1392 | 152.2934 | 190.0415 | nan | 117.4463 | 117.8741 | | gmixer_24_224 | 128 | 146.1951 | 175.0876 | nan | 135.0183 | 117.3388 | 114.0534 | | twins_pcpvt_base | 64 | 137.1023 | 137.0422 | 182.5678 | nan | 116.5847 | 119.4194 | | beit_base_patch16_224 | 64 | 128.3245 | 130.7178 | nan | nan | 115.5209 | 116.6013 | | volo_d1_224 | 64 | 153.4263 | 154.0114 | 191.2014 | nan | 111.008 | 112.4651 | | vit_base_patch16_224 | 64 | 119.0254 | 119.7739 | 155.0783 | 125.0981 | 110.5085 | 111.9288 | | deit_base_distilled_patch16_224 | 64 | 119.7891 | 119.9004 | 156.0319 | 122.0432 | 110.4275 | 111.0473 | | repvgg_a2 | 128 | 127.1763 | 127.4408 | 146.672 | 107.9738 | 104.8742 | 104.9917 | | tf_efficientnet_b0 | 128 | 133.9741 | 167.0583 | nan | 112.4467 | 104.0972 | 103.0629 | | xcit_large_24_p8_224 | 5 | 135.3955 | 145.4859 | 172.6608 | nan | 101.7101 | 104.3017 | | cspdarknet53 | 64 | 130.2875 | 130.8169 | 169.3041 | 106.4311 | 101.0594 | 100.1929 | | mobilevit_s | 64 | 116.9966 | 150.3032 | nan | nan | 97.5447 | 97.8385 | | jx_nest_base | 32 | 121.4846 | 121.669 | 165.7401 | nan | 97.3815 | 99.8115 | | fbnetc_100 | 128 | 123.3375 | 123.764 | 150.9951 | 95.7324 | 95.6139 | 94.1598 | | rexnet_100 | 128 | 119.0533 | 141.9962 | nan | 99.8032 | 95.6027 | 95.4106 | | tinynet_a | 128 | 110.1302 | 136.9918 | 171.0372 | 92.5278 | 89.5562 | 88.8168 | | sebotnet33ts_256 | 64 | 114.418 | 138.4354 | nan | nan | 88.1194 | 87.8349 | | spnasnet_100 | 128 | 105.9633 | 106.7075 | 131.8347 | 83.8313 | 82.54 | 81.0476 | | ese_vovnet19b_dw | 128 | 99.5593 | 99.6246 | 130.9876 | 84.7703 | 78.7074 | 78.1501 | | mnasnet_100 | 128 | 98.7859 | 99.0966 | 121.5786 | 76.1763 | 75.8613 | 74.5722 | | crossvit_9_240 | 128 | 98.3399 | 98.6471 | 129.6388 | 94.7645 | 74.4209 | 76.0204 | | resmlp_12_224 | 128 | 71.1246 | 71.1229 | 102.4918 | 58.9038 | 72.2292 | 74.7269 | | selecsls42b | 128 | 89.5668 | 89.7577 | 109.9852 | 73.7351 | 70.6635 | 71.4713 | | mobilenetv2_100 | 128 | 97.6813 | 98.006 | 133.5999 | 73.2992 | 70.639 | 69.6959 | | mobilenetv3_large_100 | 128 | 85.3993 | 86.1197 | 108.2027 | 64.1692 | 62.1809 | 61.2413 | | ghostnet_100 | 128 | 114.7642 | 117.7727 | 138.5075 | 89.4628 | 61.4488 | 62.6466 | | regnety_002 | 128 | 52.9334 | 51.8273 | 60.4361 | 53.4512 | 35.1405 | 38.9156 | | lcnet_050 | 128 | 38.2983 | 38.5968 | 47.742 | 27.1025 | 22.0931 | 22.4513 | | tnt_s_patch16_224 | 128 | 471.5541 | 471.3175 | nan | nan | nan | 311.3415 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_float32.png : ![](https://i.imgur.com/DGUl0NP.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/bwqOnRO.png) bench_logs/timm_models_float32.png : ![](https://i.imgur.com/hMbhCnC.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 44/45  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 96%, 43/45  | 97%, 59/61  |
|     aot_cudagraphs     | 80%, 43/54 | 73%, 33/45  | 90%, 55/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  7%, 3/45   | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 87%, 39/45  | 89%, 54/61  |
| inductor_no_cudagraphs | 85%, 46/54 | 91%, 41/45  | 89%, 54/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.24x    |    1.02x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.10x    |    1.09x    |
|        inductor        |   1.66x    |    1.62x    |    1.17x    |
| inductor_no_cudagraphs |   1.29x    |    1.54x    |    1.15x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.03    |    3.09     |    2.28     |
|       aot_eager        |    7.33    |    11.78    |    9.69     |
|     aot_cudagraphs     |   10.29    |    20.14    |    16.96    |
|    nvprims_nvfuser     |   61.99    |    99.19    |   144.26    |
|        inductor        |   76.19    |    41.09    |    77.56    |
| inductor_no_cudagraphs |   74.25    |    36.23    |    75.97    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.83x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.37x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.07x    |    0.86x    |
|        inductor        |   0.78x    |    0.92x    |    0.88x    |
| inductor_no_cudagraphs |   0.93x    |    1.07x    |    1.03x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Previous report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_691 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 81%, 44/54 | 81%, 44/54 | | inductor | huggingface | 83%, 38/46 | 83%, 38/46 | | inductor | timm_models | 87%, 53/61 | 89%, 54/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 87%, 47/54 | | inductor_no_cudagraphs | huggingface | 85%, 39/46 | 85%, 39/46 | | inductor_no_cudagraphs | timm_models | 89%, 54/61 | 89%, 54/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.65x | 1.64x | | inductor | huggingface | 1.64x | 1.64x | | inductor | timm_models | 1.18x | 1.17x | | inductor_no_cudagraphs | torchbench | 1.28x | 1.28x | | inductor_no_cudagraphs | huggingface | 1.57x | 1.56x | | inductor_no_cudagraphs | timm_models | 1.15x | 1.15x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | pass | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | YituTechConvBert | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | ghostnet_100 | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | | timm_models | hrnet_w18 | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | nvidia_deeprecommender | 0.9267 | 0.9625 | | torchbench | timm_regnet | 0.9052 | 0.8717 | | torchbench | mobilenet_v2 | 0.8754 | 0.8065 | | torchbench | yolov3 | 0.8647 | 0.8273 | | torchbench | resnet50 | 0.8492 | 0.8171 | | torchbench | tacotron2 | 0.0 | 0.8829 | | torchbench | hf_GPT2_large | 0.0 | 1.8599 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.2553 | 0.9099 | | huggingface | DebertaV2ForQuestionAnswering | 1.1403 | 0.9292 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.1708 | | huggingface | YituTechConvBert | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | fbnetc_100 | 1.0818 | 0.9102 | | timm_models | selecsls42b | 0.898 | 0.9042 | | timm_models | tf_efficientnet_b0 | 0.8699 | 0.8251 | | timm_models | mobilenetv2_100 | 0.8388 | 0.653 | | timm_models | res2net101_26w_4s | 0.8346 | 0.7918 | | timm_models | cspdarknet53 | 0.8245 | 0.8131 | | timm_models | gernet_l | 0.8221 | 0.7974 | | timm_models | dla102 | 0.8091 | 0.8273 | | timm_models | tinynet_a | 0.8021 | 0.7973 | | timm_models | resnest101e | 0.7811 | 0.8121 | | timm_models | ese_vovnet19b_dw | 0.7736 | 0.656 | | timm_models | dpn107 | 0.7689 | 0.744 | | timm_models | repvgg_a2 | 0.7386 | 0.7585 | | timm_models | convmixer_768_32 | 0.7354 | 0.7056 | | timm_models | mobilevit_s | 0.7303 | 0.795 | | timm_models | gluon_xception65 | 0.7168 | 0.7179 | | timm_models | visformer_small | 0.7162 | 0.6934 | | timm_models | eca_botnext26ts_256 | 0.715 | 0.6404 | | timm_models | botnet26t_256 | 0.7091 | 0.6439 | | timm_models | rexnet_100 | 0.703 | 0.7879 | | timm_models | swsl_resnext101_32x16d | 0.7008 | 0.7167 | | timm_models | res2net50_14w_8s | 0.6981 | 0.6538 | | timm_models | sebotnet33ts_256 | 0.6973 | 0.6548 | | timm_models | res2next50 | 0.5878 | 0.5965 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+-----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------+------------------------+ | torchbench | yolov3 | 1094.1028 | 1086.7458 | | torchbench | densenet121 | 405.9136 | 405.616 | | torchbench | timm_efficientdet | 302.3608 | 297.6612 | | torchbench | mobilenet_v3_large | 168.3954 | 166.9492 | | torchbench | timm_efficientnet | 158.4309 | 156.2107 | | torchbench | hf_T5_large | 123.5582 | 119.368 | | torchbench | mnasnet1_0 | 120.5305 | 120.2723 | | huggingface | DebertaV2ForQuestionAnswering | 151.5806 | 58.482 | | huggingface | DebertaV2ForMaskedLM | 150.919 | 58.5016 | | huggingface | XLNetLMHeadModel | 134.4824 | 146.8149 | | timm_models | hrnet_w18 | 210.8754 | 205.0929 | | timm_models | pnasnet5large | 205.9631 | 196.4466 | | timm_models | res2net50_14w_8s | 202.0239 | 201.0507 | | timm_models | ghostnet_100 | 198.5365 | 196.9742 | | timm_models | res2net101_26w_4s | 139.7464 | 136.6911 | | timm_models | dpn107 | 134.65 | 133.7754 | | timm_models | twins_pcpvt_base | 127.0139 | 125.1731 | | timm_models | rexnet_100 | 123.9916 | 122.0088 | | timm_models | mobilevit_s | 121.1379 | 118.8476 | +-------------+-------------------------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+---------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+----------+------------------------+ | torchbench | mobilenet_v2 | 0.885 | 1.0922 | | torchbench | timm_vision_transformer_large | 0.879 | 0.9541 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | fastNLP_Bert | 0.8521 | 1.0681 | | torchbench | Background_Matting | 0.845 | 1.0424 | | torchbench | hf_DistilBert | 0.8384 | 0.9049 | | torchbench | timm_regnet | 0.836 | 1.0095 | | torchbench | yolov3 | 0.8316 | 0.9828 | | torchbench | hf_Bart | 0.8224 | 1.0097 | | torchbench | shufflenet_v2_x1_0 | 0.8124 | 0.9457 | | torchbench | resnet152 | 0.8056 | 0.9398 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | pytorch_unet | 0.7877 | 0.7907 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_resnest | 0.7214 | 0.9567 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | timm_vovnet | 0.6882 | 0.8809 | | torchbench | resnet50 | 0.6745 | 0.8696 | | torchbench | mnasnet1_0 | 0.659 | 0.7667 | | torchbench | mobilenet_v3_large | 0.6583 | 0.8111 | | torchbench | resnext50_32x4d | 0.651 | 0.7706 | | torchbench | squeezenet1_1 | 0.6336 | 0.7041 | | torchbench | hf_Reformer | 0.5851 | 1.0017 | | torchbench | lennard_jones | 0.5641 | 0.9993 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | resnet18 | 0.5498 | 0.618 | | torchbench | densenet121 | 0.5389 | 0.6133 | | torchbench | LearningToPaint | 0.4882 | 0.6195 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | DistilBertForMaskedLM | 0.8716 | 0.9439 | | huggingface | M2M100ForConditionalGeneration | 0.8688 | 1.0524 | | huggingface | Speech2Text2ForCausalLM | 0.8672 | 0.9793 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9649 | | huggingface | DebertaV2ForMaskedLM | 0.6117 | 0.9912 | | huggingface | MobileBertForQuestionAnswering | 0.5988 | 0.8126 | | huggingface | DebertaV2ForQuestionAnswering | 0.5266 | 0.9885 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1614 | | timm_models | mobilenetv2_100 | 0.8962 | 1.1046 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | mixnet_l | 0.8815 | 0.98 | | timm_models | eca_botnext26ts_256 | 0.8765 | 1.1944 | | timm_models | dla102 | 0.8723 | 1.0161 | | timm_models | fbnetv3_b | 0.8648 | 1.0056 | | timm_models | adv_inception_v3 | 0.8599 | 0.9862 | | timm_models | gluon_inception_v3 | 0.8599 | 0.9862 | | timm_models | inception_v3 | 0.8599 | 0.9862 | | timm_models | swsl_resnext101_32x16d | 0.852 | 0.9728 | | timm_models | dpn107 | 0.8455 | 0.944 | | timm_models | gluon_xception65 | 0.8442 | 0.965 | | timm_models | cspdarknet53 | 0.8369 | 0.9121 | | timm_models | crossvit_9_240 | 0.8174 | 1.0976 | | timm_models | res2net101_26w_4s | 0.8146 | 0.9442 | | timm_models | resmlp_12_224 | 0.8092 | 0.8239 | | timm_models | ese_vovnet19b_dw | 0.8041 | 1.0135 | | timm_models | convnext_base | 0.8022 | 1.0059 | | timm_models | selecsls42b | 0.7927 | 0.9534 | | timm_models | spnasnet_100 | 0.787 | 0.9293 | | timm_models | coat_lite_mini | 0.7834 | 1.0066 | | timm_models | mnasnet_100 | 0.7727 | 0.9234 | | timm_models | res2net50_14w_8s | 0.7713 | 0.9528 | | timm_models | ghostnet_100 | 0.7706 | 1.0052 | | timm_models | res2next50 | 0.7697 | 0.9414 | | timm_models | hrnet_w18 | 0.7605 | 0.9421 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | mobilenetv3_large_100 | 0.75 | 0.9634 | | timm_models | sebotnet33ts_256 | 0.7318 | 0.8133 | | timm_models | gernet_l | 0.7239 | 0.9336 | | timm_models | fbnetc_100 | 0.7101 | 0.9306 | | timm_models | lcnet_050 | 0.6955 | 0.8352 | | timm_models | jx_nest_base | 0.6668 | 0.8553 | | timm_models | botnet26t_256 | 0.6615 | 0.9434 | | timm_models | regnety_002 | 0.5858 | 0.8993 | | timm_models | repvgg_a2 | 0.5572 | 0.8383 | +-------------+---------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Accuracy regressions ~~~ +----------+------+-------------+-------------+ | compiler | name | prev_status | cur_status | +----------+------+-------------+-------------+ | inductor | dlrm | pass | fail_to_run | +----------+------+-------------+-------------+ ~~~ Performance speedup regressions ~~~ +------------------------+------------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------------+-------------+------------+ | inductor | nvidia_deeprecommender | 0.989 | 0.9267 | | inductor_no_cudagraphs | dlrm | 1.0793 | 0.0 | +------------------------+------------------------+-------------+------------+ ~~~ Compilation latency (sec) regressions ~~~ +------------------------+-------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------------+-------------+------------+ | inductor | hf_T5_large | 118.7004 | 123.5582 | | inductor | mnasnet1_0 | 118.7061 | 120.5305 | | inductor_no_cudagraphs | timm_efficientnet | 119.9608 | 156.2107 | | inductor_no_cudagraphs | mnasnet1_0 | 116.5561 | 120.2723 | +------------------------+-------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Performance speedup regressions ~~~ +----------+--------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+--------------------+-------------+------------+ | inductor | tf_efficientnet_b0 | 0.9674 | 0.8699 | +----------+--------------------+-------------+------------+ ~~~ No regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0005 | 0.9246 | 2.5758 | 0.7349 | 5.1789 | 1.145 | | timm_efficientdet | 1 | 0.9862 | 0.8185 | 2.1584 | 0.0 | 4.0731 | 1.4635 | | BERT_pytorch | 16 | 1.0115 | 0.839 | 1.6003 | 0.809 | 3.4847 | 2.3985 | | timm_vision_transformer | 8 | 1.0023 | 0.8622 | 1.798 | 0.5919 | 3.1577 | 1.5977 | | dcgan | 32 | 0.9858 | 0.9252 | 1.6701 | 0.7033 | 2.8727 | 1.0362 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9973 | 0.9799 | 1.5222 | 0.0 | 2.8586 | 1.6861 | | hf_T5_large | 2 | 1.0184 | 0.8701 | 0.0 | 0.0 | 2.6539 | 2.216 | | resnet18 | 16 | 1.0006 | 1.0033 | 1.6117 | 0.8 | 2.644 | 1.2263 | | hf_Albert | 8 | 1.0017 | 0.9551 | 0.7749 | 0.0 | 2.38 | 2.321 | | mobilenet_v3_large | 32 | 1.0026 | 1.009 | 1.5147 | 0.7585 | 2.3794 | 1.1749 | | resnext50_32x4d | 8 | 1.0017 | 0.9668 | 1.9132 | 0.7407 | 2.2556 | 1.0397 | | drq | 1 | 0.9992 | 0.817 | 1.9347 | 0.6156 | 2.19 | 1.2015 | | pytorch_struct | 200 | 0.9855 | 0.7443 | 1.016 | 0.6002 | 2.1141 | 1.2829 | | lennard_jones | 1000 | 0.9585 | 0.7493 | 1.2609 | 0.4699 | 2.0884 | 1.0466 | | squeezenet1_1 | 32 | 0.9949 | 0.9685 | 1.4987 | 0.7119 | 2.0679 | 1.183 | | hf_Bert | 4 | 1.0357 | 0.8519 | 0.9487 | 0.0 | 2.0556 | 1.8739 | | hf_GPT2 | 4 | 1.0229 | 0.9836 | 0.8204 | 0.2891 | 1.9319 | 1.8951 | | hf_T5 | 8 | 0.9988 | 0.9244 | 0.0 | 1.3518 | 1.879 | 1.8673 | | mnasnet1_0 | 32 | 0.9996 | 1.0187 | 1.2769 | 0.7662 | 1.8593 | 1.1078 | | hf_Bart | 4 | 1.0093 | 0.8446 | 0.8956 | 0.0 | 1.8076 | 1.7144 | | LearningToPaint | 96 | 0.9995 | 1.0154 | 1.1466 | 0.8336 | 1.767 | 1.2172 | | soft_actor_critic | 256 | 0.962 | 0.7556 | 1.2687 | 0.5344 | 1.7568 | 1.0655 | | timm_efficientnet | 32 | 0.9599 | 0.8055 | 1.1104 | 0.679 | 1.703 | 1.0051 | | shufflenet_v2_x1_0 | 128 | 1.0018 | 1.0215 | 0.9845 | 0.859 | 1.6518 | 1.2564 | | speech_transformer | 32 | 0.995 | 0.8256 | 1.7789 | 0.6364 | 1.5768 | 1.6056 | | attention_is_all_you_need_pytorch | 256 | 1.006 | 0.9131 | 0.8404 | 0.0 | 1.5761 | 1.4957 | | fastNLP_Bert | 6 | 0.9973 | 0.895 | 0.767 | 0.0 | 1.5325 | 1.4727 | | hf_DistilBert | 8 | 1.0 | 0.9707 | 0.7417 | 0.3635 | 1.5138 | 1.4785 | | pytorch_stargan | 16 | 0.9954 | 1.1039 | 1.049 | 0.0 | 1.4558 | 1.4049 | | pytorch_unet | 1 | 0.9988 | 0.211 | 0.0 | 0.0 | 1.3418 | 1.315 | | timm_nfnet | 128 | 0.9991 | 0.9998 | 0.8768 | 0.9188 | 1.3095 | 1.263 | | vgg16 | 64 | 0.9992 | 0.9972 | 0.8565 | 0.9711 | 1.2686 | 1.2553 | | Super_SloMo | 6 | 0.9995 | 0.1753 | 0.0 | 0.0 | 1.246 | 1.2131 | | alexnet | 128 | 0.9991 | 0.9972 | 0.8151 | 0.9254 | 1.2147 | 1.2147 | | resnet152 | 32 | 1.0008 | 1.0072 | 1.2647 | 0.0 | 1.1778 | 1.0342 | | hf_Reformer | 4 | 0.9975 | 1.0001 | 0.9928 | 0.6441 | 1.1778 | 1.1806 | | Background_Matting | 4 | 0.9999 | 0.1446 | 0.0 | 0.0 | 1.1342 | 1.1213 | | timm_vision_transformer_large | 8 | 0.9998 | 0.9904 | 0.0 | 0.0 | 1.113 | 1.0921 | | timm_resnest | 32 | 1.0029 | 1.0196 | 0.847 | 0.9667 | 1.109 | 1.0236 | | timm_vovnet | 32 | 0.9234 | 0.89 | 0.8765 | 0.7996 | 1.0776 | 0.9567 | | tts_angular | 64 | 0.9743 | 0.9506 | 0.9733 | 0.9537 | 1.0018 | 1.02 | | demucs | 4 | 1.002 | 0.9995 | 1.0001 | 0.9987 | 0.9975 | 1.0023 | | nvidia_deeprecommender | 256 | 0.9988 | 0.9966 | 0.6969 | 1.0067 | 0.9267 | 0.9625 | | timm_regnet | 32 | 0.9761 | 0.9443 | 0.9071 | 0.7809 | 0.9052 | 0.8717 | | mobilenet_v2 | 96 | 0.9996 | 0.9892 | 0.7594 | 1.0304 | 0.8754 | 0.8065 | | yolov3 | 16 | 0.9998 | 0.9903 | 0.8052 | 0.0 | 0.8647 | 0.8273 | | resnet50 | 32 | 1.0015 | 1.0277 | 1.0639 | 0.7996 | 0.8492 | 0.8171 | | tacotron2 | 64 | 0.9671 | 0.7592 | 0.9828 | 0.6041 | 0.0 | 0.8829 | | hf_GPT2_large | 4 | 0.9999 | 0.9909 | 0.0 | 0.0 | 0.0 | 1.8599 | | dlrm | 1024 | 0.6446 | 0.7425 | 0.0 | 1.2037 | 0.0 | 0.0 | | functorch_dp_cifar10 | 64 | 0.9993 | 0.9513 | 2.4564 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9627 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | 0.0000 | pass | fail_to_run | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | yolov3 | 16 | 3.0997 | 9.22 | 12.7649 | nan | 1094.1028 | 1086.7458 | | densenet121 | 4 | 2.4615 | 13.6089 | 20.789 | 218.5569 | 405.9136 | 405.616 | | timm_efficientdet | 1 | 20.4676 | 40.9281 | 78.1907 | nan | 302.3608 | 297.6612 | | mobilenet_v3_large | 32 | 1.0567 | 5.4289 | 7.8711 | 118.8959 | 168.3954 | 166.9492 | | timm_efficientnet | 32 | 1.9242 | 7.547 | 16.4245 | 150.7609 | 158.4309 | 156.2107 | | hf_T5_large | 2 | 14.7254 | 43.4313 | nan | nan | 123.5582 | 119.368 | | mnasnet1_0 | 32 | 0.9675 | 4.934 | 6.99 | 87.8835 | 120.5305 | 120.2723 | | resnet152 | 32 | 2.7741 | 14.94 | 23.3568 | nan | 108.2571 | 107.5156 | | resnext50_32x4d | 8 | 1.0296 | 5.156 | 7.5048 | 84.4913 | 104.9607 | 105.2205 | | timm_vovnet | 32 | 1.6183 | 4.9282 | 10.5109 | 71.5853 | 90.8831 | 93.685 | | timm_regnet | 32 | 2.4325 | 9.0069 | 20.6127 | 139.4826 | 83.6405 | 82.8989 | | mobilenet_v2 | 96 | 0.9698 | 5.1222 | 7.5881 | 116.1868 | 79.1555 | 79.4461 | | shufflenet_v2_x1_0 | 128 | 1.1124 | 5.6662 | 8.2624 | 104.7127 | 78.0038 | 76.4153 | | timm_resnest | 32 | 0.7062 | 2.7687 | 4.1473 | 68.4845 | 77.7828 | 77.6348 | | resnet50 | 32 | 1.0103 | 5.3105 | 7.446 | 100.79 | 74.3784 | 73.9984 | | timm_vision_transformer_large | 8 | 3.1581 | 17.3041 | nan | nan | 73.3486 | 71.8931 | | timm_nfnet | 128 | 2.255 | 7.8326 | 11.3202 | 151.9351 | 67.2007 | 66.0244 | | squeezenet1_1 | 32 | 0.2749 | 1.1025 | 1.5315 | 7.5612 | 54.2228 | 53.796 | | resnet18 | 16 | 0.4755 | 2.0145 | 2.822 | 41.3785 | 42.0823 | 41.8559 | | timm_vision_transformer | 8 | 1.0374 | 5.1648 | 7.1682 | 80.616 | 35.3651 | 34.7035 | | LearningToPaint | 96 | 0.5041 | 2.1413 | 3.0846 | 47.9542 | 35.2728 | 34.6572 | | hf_Bart | 4 | 2.0821 | 10.1278 | 14.5608 | nan | 34.636 | 33.4718 | | BERT_pytorch | 16 | 1.8561 | 8.6257 | 12.5716 | 97.4908 | 34.0988 | 33.348 | | attention_is_all_you_need_pytorch | 256 | 1.4106 | 8.1356 | 12.3109 | nan | 32.5905 | 31.519 | | Background_Matting | 4 | 1.0535 | 9.5856 | nan | nan | 31.3125 | 30.5173 | | hf_T5 | 8 | 2.6913 | 9.9697 | nan | 83.4041 | 31.06 | 29.9341 | | fastNLP_Bert | 6 | 1.8858 | 8.0805 | 12.2199 | nan | 29.786 | 27.7464 | | speech_transformer | 32 | 2.03 | 9.9708 | 36.8223 | 157.4934 | 28.6058 | 27.1362 | | pytorch_stargan | 16 | 0.4632 | 2.2756 | 3.1689 | nan | 28.5967 | 27.4781 | | Super_SloMo | 6 | 1.2032 | 7.8907 | nan | nan | 22.537 | 22.1144 | | hf_Bert | 4 | 1.9073 | 8.1798 | 11.1534 | nan | 21.828 | 21.263 | | hf_GPT2 | 4 | 1.728 | 7.1196 | 10.2279 | 83.1603 | 21.1378 | 20.1927 | | hf_Albert | 8 | 1.6327 | 7.3027 | 10.7298 | nan | 20.8476 | 20.2549 | | pytorch_struct | 200 | 0.2871 | 0.9397 | 1.5808 | 8.6685 | 20.0199 | 19.0174 | | hf_Reformer | 4 | 1.7175 | 3.1681 | 5.6383 | 18.036 | 18.3872 | 15.9454 | | hf_DistilBert | 8 | 0.8406 | 3.9397 | 6.3749 | 54.9044 | 14.3843 | 13.7991 | | pytorch_unet | 1 | 0.5184 | 3.4855 | nan | nan | 12.7073 | 12.3894 | | dcgan | 32 | 0.1823 | 0.4664 | 0.7029 | 5.3426 | 10.707 | 10.3811 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4738 | 2.2282 | 3.0305 | nan | 9.6573 | 9.3877 | | drq | 1 | 0.3295 | 0.7137 | 1.1101 | 7.1731 | 4.4814 | 3.7395 | | vgg16 | 64 | 0.21 | 0.7419 | 1.094 | 5.5652 | 4.2951 | 4.1839 | | alexnet | 128 | 0.1859 | 0.4967 | 0.7605 | 5.2579 | 3.8462 | 3.7239 | | nvidia_deeprecommender | 256 | 0.2041 | 0.4998 | 0.7795 | 6.2077 | 3.6257 | 3.4618 | | soft_actor_critic | 256 | 0.2246 | 0.4031 | 0.6328 | 3.075 | 3.6003 | 2.881 | | lennard_jones | 1000 | 0.1598 | 0.4029 | 0.5581 | 3.2779 | 2.1933 | 1.9683 | | tts_angular | 64 | 0.1961 | 0.2488 | 0.3809 | 1.6194 | 1.897 | 1.7215 | | demucs | 4 | 0.3575 | 0.3587 | 0.3576 | 0.3563 | 0.2672 | 0.2737 | | hf_GPT2_large | 4 | 5.929 | 22.0352 | nan | nan | nan | 55.3977 | | tacotron2 | 64 | 5.1609 | 20.438 | 34.9267 | 95.5292 | nan | 46.1971 | | dlrm | 1024 | 0.4746 | 0.921 | nan | 5.3749 | nan | nan | | functorch_dp_cifar10 | 64 | 0.35 | 1.5789 | 2.2903 | nan | nan | nan | | hf_BigBird | 2 | 3.9946 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.334 | 1.1938 | 1.0924 | 1.0966 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.024 | 1.176 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3082 | nan | 0.9837 | 1.1225 | | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2717 | 0.4638 | 0.9694 | 1.2228 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.38 | 1.1204 | 0.9649 | 1.1241 | | BERT_pytorch | 16 | 1.0 | 0.8825 | 0.3999 | 1.1118 | 0.9564 | 1.1347 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8637 | 0.4228 | nan | 0.9506 | 1.05 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9361 | 1.2946 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4816 | 0.9296 | 1.0969 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.928 | 1.247 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 0.885 | 1.0922 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | nan | 0.879 | 0.9541 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3902 | nan | 0.8735 | 0.942 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8521 | 1.0681 | | Background_Matting | 4 | 1.0138 | 0.6522 | nan | nan | 0.845 | 1.0424 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3415 | 1.0708 | 0.8384 | 0.9049 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3491 | 0.8027 | 0.836 | 1.0095 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3536 | nan | 0.8316 | 0.9828 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3634 | nan | 0.8224 | 1.0097 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8124 | 0.9457 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3631 | nan | 0.8056 | 0.9398 | | alexnet | 128 | 0.951 | 0.7753 | 0.4794 | 0.775 | 0.7973 | 1.0079 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.7877 | 0.7907 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.734 | 0.7633 | 1.0588 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.7214 | 0.9567 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3918 | 1.0881 | 0.7151 | 0.7249 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3407 | 0.7755 | 0.6882 | 0.8809 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.356 | 0.7806 | 0.6745 | 0.8696 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3407 | 0.8226 | 0.659 | 0.7667 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.6583 | 0.8111 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3889 | 0.81 | 0.651 | 0.7706 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.6336 | 0.7041 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 1.0 | 0.5851 | 1.0017 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.5641 | 0.9993 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.5498 | 0.618 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.5389 | 0.6133 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.4882 | 0.6195 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | dlrm | 1024 | 0.8152 | 0.8152 | nan | 0.8149 | nan | nan | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 184.3823 | 186.2037 | nan | nan | 165.7917 | 168.7383 | | Background_Matting | 4 | 133.8072 | 923.9395 | nan | nan | 117.7667 | 119.1051 | | timm_nfnet | 128 | 132.0118 | 131.4519 | 148.9431 | 143.3562 | 99.9836 | 104.1179 | | hf_T5 | 8 | 174.6872 | 188.4622 | nan | 129.0446 | 93.1037 | 93.2515 | | hf_T5_large | 2 | 220.5506 | 261.4146 | nan | nan | 88.8068 | 103.7275 | | timm_regnet | 32 | 73.2128 | 77.2652 | 80.9921 | 93.354 | 80.6868 | 84.7396 | | resnet152 | 32 | 90.7181 | 91.1336 | 73.2868 | nan | 80.6139 | 93.3194 | | yolov3 | 16 | 68.5482 | 69.2295 | 85.178 | nan | 79.5913 | 82.984 | | hf_Reformer | 4 | 82.3502 | 82.337 | 83.0096 | 127.8082 | 69.8668 | 69.6797 | | Super_SloMo | 6 | 79.4209 | 451.6428 | nan | nan | 63.8552 | 65.4854 | | demucs | 4 | 56.9276 | 56.9758 | 56.7597 | 57.089 | 57.5427 | 56.8816 | | mobilenet_v2 | 96 | 48.8217 | 49.3642 | 64.4586 | 47.4741 | 55.9231 | 60.5773 | | vgg16 | 64 | 66.1256 | 66.3783 | 77.0659 | 68.2455 | 52.1381 | 52.5829 | | timm_efficientdet | 1 | 165.3463 | 211.1981 | 76.7167 | nan | 41.0713 | 115.9288 | | resnet50 | 32 | 33.2385 | 35.2839 | 32.2581 | 43.1358 | 40.4289 | 44.8044 | | speech_transformer | 32 | 61.7742 | 74.142 | 37.2008 | 95.5096 | 40.2708 | 39.6491 | | fastNLP_Bert | 6 | 55.9343 | 62.4353 | 72.8562 | nan | 36.5993 | 37.8611 | | attention_is_all_you_need_pytorch | 256 | 52.6727 | 58.3478 | 63.4174 | nan | 34.2092 | 35.3641 | | hf_Bart | 4 | 56.6586 | 67.9562 | 65.3671 | nan | 33.2975 | 34.6255 | | timm_vovnet | 32 | 35.3201 | 36.3271 | 37.1433 | 39.942 | 31.8685 | 33.666 | | pytorch_unet | 1 | 40.0003 | 189.7247 | nan | nan | 29.7373 | 30.414 | | hf_Albert | 8 | 68.2408 | 71.576 | 88.3044 | nan | 28.7158 | 29.4399 | | timm_efficientnet | 32 | 48.2739 | 57.4223 | 43.1565 | 69.0366 | 28.5985 | 48.0438 | | hf_GPT2 | 4 | 48.107 | 50.4504 | 60.1697 | 170.7383 | 25.4117 | 25.9578 | | shufflenet_v2_x1_0 | 128 | 40.4642 | 39.9604 | 41.6275 | 48.2496 | 25.256 | 32.7242 | | timm_resnest | 32 | 25.9622 | 23.985 | 29.4661 | 25.5635 | 22.1356 | 24.2638 | | hf_Bert | 4 | 40.0417 | 58.0247 | 44.1728 | nan | 21.0211 | 23.6026 | | hf_DistilBert | 8 | 31.1297 | 32.1121 | 42.0039 | 85.5861 | 20.575 | 20.9488 | | BERT_pytorch | 16 | 54.4796 | 65.1937 | 35.1523 | 81.3936 | 16.4286 | 23.7752 | | mnasnet1_0 | 32 | 29.2019 | 29.244 | 23.098 | 38.5959 | 16.0425 | 26.9683 | | mobilenet_v3_large | 32 | 35.4056 | 35.6105 | 23.882 | 48.3372 | 15.5623 | 31.0929 | | densenet121 | 4 | 73.3541 | 81.3631 | 29.2704 | 102.4318 | 15.0107 | 68.9028 | | resnext50_32x4d | 8 | 28.9636 | 30.2137 | 15.6831 | 39.9454 | 13.4466 | 29.1024 | | nvidia_deeprecommender | 256 | 10.3732 | 10.4216 | 14.8928 | 10.3006 | 11.1734 | 10.7785 | | pytorch_stargan | 16 | 16.0252 | 14.62 | 15.4472 | nan | 10.9382 | 11.4079 | | timm_vision_transformer | 8 | 29.793 | 34.952 | 16.692 | 50.7477 | 9.9242 | 19.5021 | | LearningToPaint | 96 | 14.7637 | 14.7998 | 12.9946 | 18.0347 | 8.6435 | 12.4784 | | alexnet | 128 | 9.7977 | 9.8324 | 12.0204 | 10.5963 | 8.068 | 8.0796 | | squeezenet1_1 | 32 | 15.1859 | 16.0194 | 10.151 | 20.9376 | 7.3303 | 12.7457 | | pytorch_CycleGAN_and_pix2pix | 1 | 17.8756 | 18.0556 | 11.7184 | nan | 6.4434 | 10.8191 | | tts_angular | 64 | 6.3689 | 7.0702 | 6.5367 | 6.9431 | 6.3781 | 6.7248 | | resnet18 | 16 | 12.8767 | 12.9596 | 8.0712 | 16.5369 | 4.9761 | 10.6094 | | pytorch_struct | 200 | 4.5991 | 6.0905 | 4.5567 | 7.6431 | 2.2601 | 3.707 | | drq | 1 | 4.0354 | 4.8367 | 2.0255 | 7.6093 | 2.1748 | 3.3772 | | dcgan | 32 | 3.1898 | 3.3856 | 1.8889 | 4.4948 | 1.1001 | 3.0488 | | soft_actor_critic | 256 | 1.462 | 1.9535 | 1.1437 | 2.8265 | 0.8617 | 1.4227 | | lennard_jones | 1000 | 1.4943 | 2.2048 | 1.1918 | 3.2211 | 0.7344 | 1.4535 | | tacotron2 | 64 | 3157.1252 | 4035.4329 | 3078.105 | 4956.3166 | nan | 3559.7976 | | hf_GPT2_large | 4 | 209.9124 | 211.6972 | nan | nan | nan | 112.6711 | | dlrm | 1024 | 249.8139 | 232.01 | nan | 162.6481 | nan | nan | | functorch_dp_cifar10 | 64 | 13.9137 | 14.999 | 5.8615 | nan | nan | nan | | hf_BigBird | 2 | 199.5449 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 1.0195 | 0.8405 | 1.2984 | 0.0 | 2.8812 | 1.806 | | MT5ForConditionalGeneration | 16 | 1.0198 | 0.8804 | 1.2763 | 0.873 | 2.4071 | 2.0405 | | OPTForCausalLM | 2 | 0.9999 | 0.9343 | 0.0 | 0.7842 | 2.3958 | 2.3829 | | GPT2ForSequenceClassification | 4 | 1.0003 | 0.9768 | 0.0 | 0.5006 | 2.3125 | 2.2877 | | MobileBertForQuestionAnswering | 128 | 1.023 | 0.8228 | 0.9722 | 0.0 | 2.2185 | 1.7928 | | ElectraForQuestionAnswering | 64 | 1.0002 | 0.9785 | 0.7686 | 0.0 | 2.112 | 2.054 | | XLNetLMHeadModel | 8 | 0.9992 | 0.9722 | 0.0 | 0.0 | 1.9181 | 1.9215 | | LayoutLMForSequenceClassification | 16 | 1.0004 | 0.98 | 0.776 | 0.0 | 1.849 | 1.8057 | | RobertaForQuestionAnswering | 16 | 0.9998 | 0.979 | 0.7737 | 0.0 | 1.8297 | 1.7674 | | ElectraForCausalLM | 32 | 1.0011 | 0.9394 | 0.716 | 0.0 | 1.8235 | 1.8249 | | BertForQuestionAnswering | 16 | 1.0002 | 0.9799 | 0.763 | 0.0 | 1.8182 | 1.7748 | | M2M100ForConditionalGeneration | 16 | 1.0102 | 0.9011 | 1.0321 | 0.6913 | 1.7337 | 1.6287 | | XGLMForCausalLM | 8 | 1.0091 | 0.8351 | 1.0778 | 0.0 | 1.7331 | 1.5994 | | RobertaForCausalLM | 16 | 1.0001 | 0.9711 | 0.7619 | 0.0 | 1.6999 | 1.6842 | | DistillGPT2 | 16 | 0.9995 | 0.9696 | 0.7634 | 0.746 | 1.689 | 1.7166 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.8857 | 0.0 | 0.0 | 1.6665 | 1.6553 | | PLBartForConditionalGeneration | 4 | 1.0002 | 0.9541 | 0.7382 | 0.0 | 1.66 | 1.6548 | | AlbertForMaskedLM | 4 | 1.0005 | 0.8852 | 0.0 | 0.0 | 1.6542 | 1.642 | | LayoutLMForMaskedLM | 16 | 1.0006 | 0.971 | 0.7566 | 0.0 | 1.6238 | 1.6103 | | MegatronBertForQuestionAnswering | 8 | 0.9992 | 0.9843 | 0.7719 | 0.0 | 1.6227 | 1.59 | | MegatronBertForCausalLM | 4 | 1.0164 | 0.9073 | 0.7609 | 0.0 | 1.6141 | 1.5716 | | T5Small | 4 | 0.9991 | 0.9202 | 0.7575 | 1.1451 | 1.6123 | 1.5799 | | T5ForConditionalGeneration | 4 | 0.9997 | 0.9175 | 0.7566 | 1.1555 | 1.601 | 1.5833 | | BertForMaskedLM | 16 | 1.0003 | 0.97 | 0.7523 | 0.0 | 1.594 | 1.5844 | | PLBartForCausalLM | 8 | 0.9992 | 0.9664 | 0.7633 | 0.9721 | 1.59 | 1.5703 | | CamemBert | 16 | 1.0001 | 0.9726 | 0.7664 | 0.0 | 1.5356 | 1.5211 | | DistilBertForQuestionAnswering | 256 | 0.9996 | 0.995 | 0.758 | 0.666 | 1.513 | 1.4882 | | MBartForConditionalGeneration | 2 | 1.0037 | 0.9703 | 0.0 | 0.7951 | 1.4637 | 1.4343 | | BartForConditionalGeneration | 2 | 1.0045 | 0.97 | 0.0 | 0.0 | 1.4602 | 1.4194 | | MBartForCausalLM | 4 | 1.0001 | 0.9655 | 0.7582 | 0.9931 | 1.4345 | 1.4343 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0049 | 0.9179 | 0.7371 | 0.0 | 1.4117 | 1.4121 | | BartForCausalLM | 4 | 1.0002 | 0.9692 | 0.7571 | 0.0 | 1.4044 | 1.4222 | | Speech2Text2ForCausalLM | 256 | 0.9972 | 0.9494 | 0.6884 | 0.9056 | 1.3359 | 1.3607 | | DebertaForMaskedLM | 4 | 0.8929 | 0.7272 | 0.7896 | 0.0 | 1.2933 | 1.1616 | | PegasusForConditionalGeneration | 32 | 1.0026 | 0.9497 | 0.0 | 0.8043 | 1.2735 | 1.2617 | | DebertaV2ForMaskedLM | 1 | 0.8793 | 0.6925 | 0.8407 | 0.0 | 1.2553 | 0.9099 | | TrOCRForCausalLM | 32 | 1.0 | 0.9636 | 0.0 | 0.0 | 1.2505 | 1.259 | | DistilBertForMaskedLM | 128 | 0.9998 | 0.9595 | 0.7148 | 0.6501 | 1.2224 | 1.2362 | | BlenderbotSmallForCausalLM | 64 | 0.9993 | 0.9297 | 0.7195 | 0.0 | 1.2036 | 1.2133 | | PegasusForCausalLM | 32 | 1.0017 | 0.9547 | 0.7587 | 0.8511 | 1.1632 | 1.1517 | | DebertaForQuestionAnswering | 8 | 0.9744 | 0.8757 | 0.7232 | 0.0 | 1.1433 | 1.2276 | | DebertaV2ForQuestionAnswering | 2 | 0.8832 | 0.7005 | 0.0 | 0.0 | 1.1403 | 0.9292 | | BlenderbotForCausalLM | 4 | 1.0118 | 0.0 | 0.0 | 0.0 | 0.0 | 1.1708 | | YituTechConvBert | 16 | 1.0002 | 0.9656 | 0.7907 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaV2ForQuestionAnswering | 2 | 8.8464 | 20.0029 | nan | nan | 151.5806 | 58.482 | | DebertaV2ForMaskedLM | 1 | 8.9913 | 20.0424 | 80.3869 | nan | 150.919 | 58.5016 | | XLNetLMHeadModel | 8 | 4.9781 | 22.0757 | nan | nan | 134.4824 | 146.8149 | | DebertaForMaskedLM | 4 | 4.9308 | 11.5922 | 37.1335 | nan | 92.5237 | 40.1985 | | DebertaForQuestionAnswering | 8 | 5.0849 | 11.6552 | 36.6555 | nan | 89.6065 | 40.8693 | | XGLMForCausalLM | 8 | 3.193 | 15.1204 | 29.395 | nan | 79.7103 | 78.1222 | | MobileBertForQuestionAnswering | 128 | 10.1409 | 35.6088 | 61.1285 | nan | 75.5178 | 73.0748 | | MobileBertForMaskedLM | 64 | 9.8247 | 35.5404 | 61.4623 | nan | 73.9042 | 71.6121 | | M2M100ForConditionalGeneration | 16 | 4.134 | 17.9491 | 29.1019 | 226.7622 | 61.3456 | 57.651 | | MT5ForConditionalGeneration | 16 | 3.9349 | 14.5408 | 23.2585 | 127.9457 | 56.9939 | 55.8909 | | PegasusForConditionalGeneration | 32 | 3.7021 | 18.7778 | nan | 244.8486 | 55.4728 | 51.7189 | | BartForConditionalGeneration | 2 | 3.8892 | 19.1066 | nan | nan | 54.8155 | 52.55 | | MBartForConditionalGeneration | 2 | 3.9723 | 19.5039 | nan | 268.7531 | 53.3136 | 52.2451 | | MegatronBertForCausalLM | 4 | 4.0918 | 16.1925 | 23.8141 | nan | 43.3787 | 41.989 | | MegatronBertForQuestionAnswering | 8 | 3.9748 | 16.0222 | 24.0943 | nan | 42.9941 | 41.7296 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4983 | 12.7172 | 19.9494 | nan | 36.8351 | 35.6094 | | T5ForConditionalGeneration | 4 | 2.8111 | 9.7759 | 14.6657 | 84.8891 | 32.1475 | 31.2307 | | T5Small | 4 | 2.8243 | 9.7968 | 14.7263 | 84.7299 | 31.8039 | 31.1257 | | PLBartForConditionalGeneration | 4 | 2.0907 | 10.0904 | 14.9111 | nan | 31.7864 | 31.1227 | | LayoutLMForSequenceClassification | 16 | 2.2805 | 8.4864 | 12.1845 | nan | 29.489 | 28.9507 | | ElectraForCausalLM | 32 | 1.9824 | 8.0446 | 12.0073 | nan | 28.8023 | 26.8734 | | PegasusForCausalLM | 32 | 1.5392 | 7.4332 | 11.2559 | 99.1693 | 26.9561 | 25.3846 | | MBartForCausalLM | 4 | 1.5971 | 7.7614 | 10.9305 | 102.2403 | 25.3545 | 24.7645 | | BartForCausalLM | 4 | 1.5561 | 7.4157 | 10.5346 | nan | 24.7711 | 23.772 | | LayoutLMForMaskedLM | 16 | 2.3268 | 9.2597 | 12.1738 | nan | 24.1466 | 23.3752 | | TrOCRForCausalLM | 32 | 1.4802 | 7.2399 | nan | nan | 23.8667 | 22.9307 | | RobertaForCausalLM | 16 | 1.9594 | 8.1874 | 11.7076 | nan | 23.2745 | 22.3619 | | ElectraForQuestionAnswering | 64 | 1.974 | 8.0642 | 11.6738 | nan | 23.1543 | 22.2594 | | BertForMaskedLM | 16 | 1.899 | 8.073 | 11.8064 | nan | 23.078 | 22.6862 | | OPTForCausalLM | 2 | 1.5804 | 7.7929 | nan | 95.1854 | 22.8269 | 22.3784 | | BertForQuestionAnswering | 16 | 1.945 | 7.9986 | 11.4815 | nan | 22.4412 | 21.7203 | | CamemBert | 16 | 1.9578 | 8.0072 | 11.7 | nan | 22.2229 | 21.8035 | | RobertaForQuestionAnswering | 16 | 2.004 | 8.0594 | 11.5998 | nan | 21.4128 | 20.4034 | | GPT2ForSequenceClassification | 4 | 1.737 | 7.663 | nan | 84.4803 | 20.5616 | 20.3485 | | AlbertForMaskedLM | 4 | 1.7524 | 7.4306 | nan | nan | 19.7874 | 19.3802 | | AlbertForQuestionAnswering | 4 | 1.721 | 7.5682 | nan | nan | 19.4674 | 18.9224 | | BlenderbotSmallForCausalLM | 64 | 0.9951 | 4.811 | 7.0232 | nan | 17.7639 | 17.3552 | | Speech2Text2ForCausalLM | 256 | 0.8767 | 3.8465 | 5.8103 | 50.8379 | 15.4401 | 14.1723 | | PLBartForCausalLM | 8 | 0.871 | 3.8848 | 5.6416 | 66.312 | 14.9835 | 14.9343 | | DistillGPT2 | 16 | 0.9367 | 3.6925 | 5.5332 | 45.9192 | 14.3946 | 14.0881 | | DistilBertForMaskedLM | 128 | 0.8163 | 3.8973 | 6.3275 | 57.258 | 13.1703 | 13.0914 | | DistilBertForQuestionAnswering | 256 | 0.8277 | 3.931 | 6.3009 | 54.661 | 12.5363 | 12.3625 | | BlenderbotForCausalLM | 4 | 2.8445 | nan | nan | nan | nan | 44.1084 | | YituTechConvBert | 16 | 2.771 | 11.9692 | 18.09 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9997 | 0.9183 | nan | 1.2641 | 1.2906 | 1.345 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.2229 | 1.0775 | 1.1712 | | MBartForCausalLM | 4 | 1.0 | 0.8998 | 0.3747 | 1.3748 | 1.0747 | 1.1342 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | XLNetLMHeadModel | 8 | 0.9999 | 0.9214 | nan | nan | 1.0303 | 1.0303 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9035 | nan | 1.3227 | 1.0148 | 1.2186 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | PegasusForConditionalGeneration | 32 | 0.9979 | 0.9511 | nan | 1.2087 | 1.0039 | 1.1394 | | RobertaForQuestionAnswering | 16 | 1.004 | 0.9315 | 0.3619 | nan | 1.0036 | 1.0618 | | BertForQuestionAnswering | 16 | 1.004 | 0.9312 | 0.3618 | nan | 1.0029 | 1.0617 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9976 | 1.1976 | | DistilBertForQuestionAnswering | 256 | 1.0112 | 0.9568 | 0.3185 | 1.1483 | 0.9806 | 1.0864 | | DistillGPT2 | 16 | 1.0 | 0.8673 | 0.3597 | 1.1412 | 0.9755 | 1.0618 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | 1.1321 | 0.9708 | 1.0363 | | PLBartForConditionalGeneration | 4 | 0.9997 | 0.9325 | 0.3746 | nan | 0.9651 | 1.0848 | | T5Small | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9635 | 1.1856 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9635 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | MegatronBertForQuestionAnswering | 8 | 1.0006 | 0.9101 | 0.3721 | nan | 0.9562 | 1.0239 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | BertForMaskedLM | 16 | 1.0001 | 0.9237 | 0.3656 | nan | 0.9481 | 0.985 | | RobertaForCausalLM | 16 | 1.0 | 0.9237 | 0.3654 | nan | 0.9475 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9212 | 0.3657 | nan | 0.9446 | 0.983 | | TrOCRForCausalLM | 32 | 0.9998 | 0.8789 | nan | nan | 0.9345 | 1.0129 | | MT5ForConditionalGeneration | 16 | 1.0015 | 0.864 | 0.415 | 1.0159 | 0.9203 | 1.0032 | | PLBartForCausalLM | 8 | 0.9999 | 0.8707 | 0.3624 | 1.0907 | 0.9166 | 0.989 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8798 | 0.3875 | nan | 0.9121 | 1.0221 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8497 | 0.3516 | 1.0867 | 0.8716 | 0.9439 | | M2M100ForConditionalGeneration | 16 | 0.9838 | 0.9478 | 0.414 | 1.0813 | 0.8688 | 1.0524 | | Speech2Text2ForCausalLM | 256 | 0.9668 | 0.8156 | 0.3505 | 1.0447 | 0.8672 | 0.9793 | | ElectraForCausalLM | 32 | 0.9977 | 0.8464 | 0.3928 | nan | 0.856 | 0.9327 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9902 | | MobileBertForMaskedLM | 64 | 0.9999 | 0.8791 | 0.3355 | nan | 0.6698 | 0.9649 | | DebertaV2ForMaskedLM | 1 | 0.9982 | 0.941 | 0.4917 | nan | 0.6117 | 0.9912 | | MobileBertForQuestionAnswering | 128 | 1.0159 | 1.0063 | 0.306 | nan | 0.5988 | 0.8126 | | DebertaV2ForQuestionAnswering | 2 | 0.9796 | 0.9796 | nan | nan | 0.5266 | 0.9885 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3624 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3251 | nan | 0.3071 | 1.1614 | | BlenderbotForCausalLM | 4 | 1.0002 | nan | nan | nan | nan | 0.9343 | | YituTechConvBert | 16 | 0.9954 | 0.9173 | 0.3774 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.5342 | 301.2787 | nan | nan | 161.6648 | 162.7841 | | AlbertForQuestionAnswering | 4 | 264.5356 | 298.9473 | nan | nan | 159.2134 | 160.1467 | | XLNetLMHeadModel | 8 | 275.9314 | 283.9247 | nan | nan | 143.6583 | 143.0492 | | PegasusForConditionalGeneration | 32 | 139.7073 | 148.3993 | nan | 175.3888 | 111.3106 | 112.5868 | | TrOCRForCausalLM | 32 | 137.317 | 142.3816 | nan | nan | 109.845 | 108.9129 | | DebertaV2ForQuestionAnswering | 2 | 117.4926 | 146.9378 | nan | nan | 94.8276 | 115.3437 | | BartForConditionalGeneration | 2 | 135.5821 | 140.857 | nan | nan | 94.2365 | 96.0518 | | MBartForConditionalGeneration | 2 | 135.799 | 141.1339 | nan | 171.5246 | 93.0621 | 94.9823 | | MegatronBertForQuestionAnswering | 8 | 141.0728 | 143.4811 | 182.5566 | nan | 86.9118 | 88.6431 | | DebertaV2ForMaskedLM | 1 | 113.6275 | 143.4524 | 137.3796 | nan | 84.0926 | 115.5339 | | MobileBertForQuestionAnswering | 128 | 175.0953 | 216.3905 | 181.6682 | nan | 83.2161 | 107.4741 | | BartForCausalLM | 4 | 112.4783 | 114.4048 | 148.226 | nan | 78.8063 | 78.8703 | | BlenderbotSmallForConditionalGeneration | 64 | 108.8882 | 120.7117 | 150.1344 | nan | 78.4867 | 78.3301 | | MBartForCausalLM | 4 | 111.0796 | 117.0357 | 148.2934 | 113.0027 | 78.3315 | 78.3146 | | CamemBert | 16 | 118.0249 | 121.3935 | 154.186 | nan | 76.919 | 77.6523 | | PLBartForCausalLM | 8 | 110.5622 | 114.2095 | 145.4973 | 115.4672 | 71.1487 | 70.4872 | | PLBartForConditionalGeneration | 4 | 116.1015 | 121.7005 | 157.448 | nan | 70.0319 | 70.4969 | | OPTForCausalLM | 2 | 165.8489 | 177.7152 | nan | 208.6996 | 69.1706 | 69.6686 | | LayoutLMForMaskedLM | 16 | 112.0444 | 115.5636 | 148.1843 | nan | 69.0135 | 69.6334 | | DistilBertForMaskedLM | 128 | 84.1893 | 87.5963 | 117.7901 | 129.3457 | 68.8721 | 68.0848 | | BertForMaskedLM | 16 | 109.4709 | 112.8039 | 145.4095 | nan | 68.7125 | 69.1551 | | DistilBertForQuestionAnswering | 256 | 102.9628 | 103.4867 | 135.8039 | 154.4087 | 68.2058 | 69.4155 | | RobertaForCausalLM | 16 | 114.5732 | 117.9768 | 150.3753 | nan | 67.3779 | 67.9643 | | M2M100ForConditionalGeneration | 16 | 108.3086 | 122.4269 | 123.0082 | 157.4185 | 66.3163 | 70.8518 | | DebertaForQuestionAnswering | 8 | 77.1806 | 85.7354 | 103.8679 | nan | 65.6935 | 61.186 | | MobileBertForMaskedLM | 64 | 174.9815 | 223.2517 | 145.4105 | nan | 64.3476 | 108.157 | | T5ForConditionalGeneration | 4 | 101.3651 | 110.5114 | 134.4821 | 87.3665 | 63.4068 | 63.9054 | | T5Small | 4 | 101.4598 | 110.0745 | 135.3626 | 88.5956 | 63.1278 | 63.8942 | | DistillGPT2 | 16 | 105.779 | 108.985 | 138.6098 | 141.5872 | 62.593 | 61.6065 | | PegasusForCausalLM | 32 | 69.1041 | 71.9375 | 91.6237 | 81.6646 | 59.7899 | 59.7054 | | ElectraForQuestionAnswering | 64 | 115.8181 | 117.0846 | 150.3824 | nan | 54.2835 | 55.8529 | | MegatronBertForCausalLM | 4 | 85.2551 | 97.2273 | 114.7396 | nan | 54.1583 | 56.0253 | | XGLMForCausalLM | 8 | 87.4282 | 106.499 | 94.2105 | nan | 52.5831 | 58.39 | | LayoutLMForSequenceClassification | 16 | 97.2172 | 99.2644 | 125.2649 | nan | 52.5265 | 53.8325 | | BertForQuestionAnswering | 16 | 94.7597 | 96.5679 | 124.3364 | nan | 52.087 | 53.3456 | | RobertaForQuestionAnswering | 16 | 95.2625 | 97.2004 | 122.9579 | nan | 52.0251 | 53.7951 | | DebertaForMaskedLM | 4 | 68.126 | 83.952 | 78.4493 | nan | 49.8782 | 56.161 | | BlenderbotSmallForCausalLM | 64 | 58.6072 | 63.0819 | 81.6593 | nan | 48.376 | 48.1142 | | ElectraForCausalLM | 32 | 87.2429 | 92.7679 | 122.0001 | nan | 47.87 | 47.8247 | | MT5ForConditionalGeneration | 16 | 93.2274 | 108.9031 | 85.9873 | 108.3409 | 40.32 | 46.885 | | Speech2Text2ForCausalLM | 256 | 53.0678 | 56.2322 | 77.1129 | 58.5857 | 39.6997 | 38.9137 | | GPT2ForSequenceClassification | 4 | 90.6017 | 93.3373 | nan | 181.2177 | 39.1998 | 40.1823 | | BlenderbotForCausalLM | 4 | 90.9838 | nan | nan | nan | nan | 79.183 | | YituTechConvBert | 16 | 133.0492 | 137.5583 | 168.554 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0007 | 0.0 | 0.0 | 0.0 | 2.2706 | 1.8059 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9974 | 0.0 | 0.0 | 1.9909 | 1.962 | | twins_pcpvt_base | 64 | 1.0077 | 0.9228 | 0.9168 | 0.0 | 1.8912 | 1.7121 | | coat_lite_mini | 128 | 0.9996 | 0.9956 | 0.8475 | 1.1531 | 1.7953 | 1.7597 | | ghostnet_100 | 128 | 1.0042 | 0.982 | 0.8857 | 1.013 | 1.6093 | 1.4498 | | volo_d1_224 | 64 | 0.9998 | 0.9936 | 0.8465 | 0.0 | 1.5973 | 1.5638 | | gmixer_24_224 | 128 | 0.9999 | 0.8804 | 0.722 | 0.9254 | 1.5531 | 1.5081 | | gmlp_s16_224 | 128 | 0.9999 | 0.995 | 0.7869 | 1.0149 | 1.5382 | 1.5087 | | lcnet_050 | 128 | 0.9675 | 0.9489 | 0.8536 | 1.0373 | 1.4997 | 1.3401 | | cait_m36_384 | 4 | 1.0 | 0.8741 | 0.0 | 0.0 | 1.4779 | 1.4197 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9607 | 0.0 | 0.0 | 1.4692 | 1.4582 | | jx_nest_base | 32 | 0.9996 | 0.9905 | 0.8026 | 0.0 | 1.3955 | 1.3571 | | convit_base | 64 | 0.9999 | 0.9964 | 0.8336 | 1.2343 | 1.3601 | 1.3325 | | regnety_002 | 128 | 0.9782 | 0.9488 | 1.1681 | 0.8578 | 1.3272 | 1.2207 | | crossvit_9_240 | 128 | 1.0 | 0.9947 | 0.838 | 0.9179 | 1.3245 | 1.2996 | | dm_nfnet_f0 | 128 | 0.9986 | 0.9982 | 0.8773 | 0.9211 | 1.2907 | 1.2661 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9918 | 0.7973 | 0.9725 | 1.2828 | 1.2632 | | pit_b_224 | 64 | 0.9998 | 0.9955 | 0.8216 | 0.9724 | 1.2775 | 1.2747 | | mixer_b16_224 | 128 | 1.0001 | 0.9976 | 0.8032 | 0.9016 | 1.2764 | 1.2679 | | nfnet_l0 | 128 | 0.9995 | 0.8101 | 0.7134 | 0.8502 | 1.2756 | 1.2228 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9786 | 0.0 | 0.0 | 1.2101 | 1.2011 | | convnext_base | 64 | 0.9998 | 0.9956 | 0.8009 | 0.0 | 1.2012 | 1.1595 | | hrnet_w18 | 128 | 1.0032 | 1.0264 | 0.8666 | 0.0 | 1.1711 | 1.2181 | | gluon_inception_v3 | 128 | 0.9999 | 0.9959 | 0.8541 | 1.1433 | 1.1354 | 1.0774 | | pnasnet5large | 16 | 1.006 | 1.032 | 0.8485 | 0.0 | 1.1291 | 1.0541 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9934 | 0.8349 | 0.8975 | 1.1276 | 1.1126 | | resmlp_12_224 | 128 | 1.0001 | 0.9985 | 0.7826 | 1.4887 | 1.1163 | 1.0993 | | inception_v3 | 128 | 0.9999 | 0.9961 | 0.854 | 1.1443 | 1.0953 | 1.0767 | | mnasnet_100 | 128 | 0.9534 | 0.945 | 0.7876 | 1.196 | 1.0911 | 1.0656 | | fbnetc_100 | 128 | 0.9532 | 0.9444 | 0.7943 | 1.1635 | 1.0818 | 0.9102 | | adv_inception_v3 | 128 | 1.0 | 0.996 | 0.8546 | 1.1442 | 1.0809 | 1.101 | | mobilenetv3_large_100 | 128 | 0.9559 | 0.9442 | 0.7834 | 0.9786 | 1.0697 | 1.0811 | | fbnetv3_b | 128 | 0.9531 | 0.9411 | 0.7883 | 0.0 | 1.025 | 1.0531 | | spnasnet_100 | 128 | 0.9462 | 0.9375 | 0.7784 | 1.0878 | 1.0114 | 1.0011 | | mixnet_l | 128 | 0.9794 | 0.9054 | 0.794 | 0.0 | 0.9937 | 0.9623 | | tf_mixnet_l | 128 | 0.9806 | 0.9068 | 0.7952 | 0.0 | 0.9905 | 1.0124 | | poolformer_m36 | 64 | 0.9997 | 0.9983 | 0.807 | 0.0 | 0.9679 | 0.9641 | | selecsls42b | 128 | 1.0 | 0.995 | 0.8421 | 1.2833 | 0.898 | 0.9042 | | tf_efficientnet_b0 | 128 | 0.9632 | 0.8068 | 0.666 | 0.9483 | 0.8699 | 0.8251 | | mobilenetv2_100 | 128 | 0.9526 | 0.9424 | 0.7232 | 1.0773 | 0.8388 | 0.653 | | res2net101_26w_4s | 64 | 1.0027 | 1.0061 | 0.9706 | 0.0 | 0.8346 | 0.7918 | | cspdarknet53 | 64 | 0.9436 | 0.9357 | 0.757 | 1.1418 | 0.8245 | 0.8131 | | gernet_l | 128 | 0.9479 | 0.9383 | 0.7681 | 1.0621 | 0.8221 | 0.7974 | | dla102 | 128 | 1.0002 | 0.9963 | 0.8389 | 1.3149 | 0.8091 | 0.8273 | | tinynet_a | 128 | 0.9641 | 0.8063 | 0.6852 | 0.7833 | 0.8021 | 0.7973 | | resnest101e | 64 | 0.9995 | 0.9913 | 0.8121 | 0.0 | 0.7811 | 0.8121 | | ese_vovnet19b_dw | 128 | 0.9709 | 0.9648 | 0.768 | 1.1199 | 0.7736 | 0.656 | | dpn107 | 32 | 0.9399 | 0.931 | 0.7656 | 0.0 | 0.7689 | 0.744 | | repvgg_a2 | 128 | 0.9439 | 0.9351 | 0.7995 | 1.0689 | 0.7386 | 0.7585 | | convmixer_768_32 | 32 | 0.9998 | 0.998 | 0.923 | 0.0 | 0.7354 | 0.7056 | | mobilevit_s | 64 | 0.9738 | 0.8148 | 0.6565 | 0.0 | 0.7303 | 0.795 | | gluon_xception65 | 32 | 0.9997 | 0.9888 | 0.7551 | 0.0 | 0.7168 | 0.7179 | | visformer_small | 128 | 0.9995 | 1.0011 | 0.8431 | 0.0 | 0.7162 | 0.6934 | | eca_botnext26ts_256 | 128 | 0.9811 | 0.8104 | 0.6697 | 1.0689 | 0.715 | 0.6404 | | botnet26t_256 | 128 | 0.9797 | 0.9737 | 0.8127 | 1.2799 | 0.7091 | 0.6439 | | rexnet_100 | 128 | 0.9647 | 0.8503 | 0.6897 | 0.0 | 0.703 | 0.7879 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.9796 | 0.8085 | 0.0 | 0.7008 | 0.7167 | | res2net50_14w_8s | 128 | 0.9998 | 0.9929 | 0.8104 | 0.996 | 0.6981 | 0.6538 | | sebotnet33ts_256 | 64 | 0.9668 | 0.837 | 0.6804 | 0.97 | 0.6973 | 0.6548 | | res2next50 | 128 | 0.9995 | 0.9958 | 0.8315 | 1.1456 | 0.5878 | 0.5965 | | eca_halonext26ts | 128 | 0.9815 | 0.8176 | 0.6795 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | ghostnet_100 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 7.0019 | 34.0859 | 61.1018 | nan | 210.8754 | 205.0929 | | pnasnet5large | 16 | 5.4335 | 24.9212 | 43.8375 | nan | 205.9631 | 196.4466 | | res2net50_14w_8s | 128 | 3.1886 | 16.6381 | 26.6844 | 326.7123 | 202.0239 | 201.0507 | | ghostnet_100 | 128 | 3.2831 | 10.7883 | 15.5518 | 191.4486 | 198.5365 | 196.9742 | | res2net101_26w_4s | 64 | 3.541 | 18.8519 | 30.4733 | nan | 139.7464 | 136.6911 | | dpn107 | 32 | 4.1156 | 14.9333 | 40.4321 | nan | 134.65 | 133.7754 | | twins_pcpvt_base | 64 | 3.0042 | 16.858 | 27.426 | nan | 127.0139 | 125.1731 | | rexnet_100 | 128 | 2.0876 | 8.1731 | 17.7138 | nan | 123.9916 | 122.0088 | | mobilevit_s | 64 | 2.098 | 8.6403 | 15.9353 | nan | 121.1379 | 118.8476 | | fbnetv3_b | 128 | 3.4714 | 12.7296 | 29.777 | nan | 116.8415 | 115.6619 | | resnest101e | 64 | 3.6324 | 17.832 | 29.4294 | nan | 115.7902 | 114.1318 | | mixnet_l | 128 | 5.655 | 13.6811 | 27.3875 | nan | 106.1013 | 103.2616 | | tf_mixnet_l | 128 | 6.069 | 14.2529 | 27.6586 | nan | 106.0171 | 103.4148 | | tinynet_a | 128 | 2.2908 | 9.1652 | 21.1879 | 195.095 | 103.095 | 102.9649 | | adv_inception_v3 | 128 | 1.867 | 9.392 | 14.315 | 181.9132 | 102.5979 | 99.8231 | | inception_v3 | 128 | 1.8126 | 9.4124 | 14.5251 | 180.8219 | 101.8079 | 99.5829 | | gluon_inception_v3 | 128 | 1.8757 | 9.6297 | 14.3674 | 180.0502 | 100.9578 | 100.0784 | | fbnetc_100 | 128 | 2.1573 | 7.3345 | 17.67 | 133.7509 | 94.7925 | 94.6452 | | dla102 | 128 | 2.0782 | 10.3608 | 16.4176 | 240.4555 | 93.2812 | 91.3157 | | xcit_large_24_p8_224 | 5 | 3.4624 | nan | nan | nan | 88.3026 | 86.1817 | | poolformer_m36 | 64 | 1.8056 | 8.3375 | 13.1055 | nan | 87.4972 | 84.1626 | | cspdarknet53 | 64 | 2.4392 | 8.1963 | 19.5916 | 143.2218 | 87.1988 | 85.7529 | | spnasnet_100 | 128 | 2.1441 | 7.1485 | 17.3709 | 135.3 | 87.1704 | 86.2019 | | mobilenetv3_large_100 | 128 | 1.6982 | 6.2765 | 13.7831 | 140.5036 | 86.6771 | 84.9684 | | res2next50 | 128 | 1.7728 | 9.2607 | 13.8623 | 200.4742 | 86.4017 | 83.2473 | | tf_efficientnet_b0 | 128 | 2.0222 | 8.2422 | 16.7705 | 180.0918 | 82.9539 | 81.6664 | | swin_base_patch4_window7_224 | 64 | 3.196 | 14.2808 | nan | nan | 82.5585 | 80.7753 | | sebotnet33ts_256 | 64 | 1.8817 | 6.6452 | 13.7656 | 153.9544 | 79.6451 | 78.1502 | | gluon_xception65 | 32 | 2.2672 | 12.3524 | 19.4815 | nan | 77.121 | 74.0297 | | mobilenetv2_100 | 128 | 1.7578 | 6.1116 | 13.824 | 117.3896 | 75.9436 | 74.4472 | | cait_m36_384 | 4 | 3.8238 | 22.2932 | nan | nan | 74.9145 | 72.0258 | | swsl_resnext101_32x16d | 32 | 2.0367 | 10.5176 | 15.8604 | nan | 72.9356 | 70.9354 | | mnasnet_100 | 128 | 1.6895 | 5.7946 | 13.9225 | 108.2745 | 72.5878 | 72.4533 | | coat_lite_mini | 128 | 1.3023 | 6.0531 | 8.9556 | 112.0665 | 71.7501 | 71.458 | | regnety_002 | 128 | 1.7468 | 6.2764 | 14.109 | 118.4069 | 70.1006 | 68.0921 | | convnext_base | 64 | 1.6235 | 7.772 | 12.2267 | nan | 70.045 | 69.2018 | | jx_nest_base | 32 | 2.0289 | 10.4307 | 16.2418 | nan | 68.8585 | 66.6687 | | dm_nfnet_f0 | 128 | 2.3139 | 7.9201 | 11.9115 | 152.5649 | 68.5938 | 67.6936 | | visformer_small | 128 | 1.067 | 4.5508 | 6.6784 | nan | 61.5739 | 60.5114 | | eca_botnext26ts_256 | 128 | 1.5283 | 5.6986 | 11.3663 | 121.3027 | 61.228 | 59.363 | | botnet26t_256 | 128 | 1.3525 | 4.8033 | 9.4619 | 94.0761 | 57.7166 | 57.0207 | | ese_vovnet19b_dw | 128 | 1.1052 | 3.5269 | 7.1184 | 68.9258 | 54.2245 | 54.3665 | | selecsls42b | 128 | 0.8373 | 4.5431 | 6.1907 | 90.6675 | 53.8484 | 53.595 | | lcnet_050 | 128 | 1.1452 | 3.7134 | 7.8529 | 81.3022 | 52.6519 | 51.5435 | | gernet_l | 128 | 2.1608 | 6.8083 | 15.7821 | 110.5716 | 51.9248 | 50.9937 | | nfnet_l0 | 128 | 1.9712 | 7.9214 | 11.6792 | 140.9194 | 51.3413 | 49.7969 | | gmlp_s16_224 | 128 | 1.3789 | 8.3075 | 13.1125 | 171.9172 | 47.101 | 46.005 | | crossvit_9_240 | 128 | 1.8285 | 9.5669 | 14.4626 | 173.879 | 44.1286 | 41.3243 | | volo_d1_224 | 64 | 1.4978 | 8.4791 | 13.1631 | nan | 43.3802 | 42.6274 | | tnt_s_patch16_224 | 128 | 2.0545 | 12.3743 | nan | nan | 42.3184 | 40.9906 | | gmixer_24_224 | 128 | 1.6745 | 9.3841 | 14.6899 | 168.224 | 38.1232 | 37.6938 | | repvgg_a2 | 128 | 2.0932 | 6.6254 | 16.0178 | 188.038 | 37.8889 | 36.6284 | | convmixer_768_32 | 32 | 1.3987 | 7.1501 | 10.8134 | nan | 34.2829 | 33.3949 | | convit_base | 64 | 1.3253 | 6.7303 | 10.2747 | 136.7917 | 31.6566 | 30.7577 | | mixer_b16_224 | 128 | 0.9119 | 4.25 | 6.4593 | 82.3616 | 26.3143 | 25.3461 | | resmlp_12_224 | 128 | 0.7699 | 3.5796 | 5.2174 | 54.0362 | 25.3189 | 23.7247 | | deit_base_distilled_patch16_224 | 64 | 0.966 | 5.16 | 7.8577 | 88.4417 | 25.2715 | 23.7537 | | pit_b_224 | 64 | 1.2358 | 5.9546 | 9.1289 | 104.0795 | 25.2623 | 24.8694 | | vit_base_patch16_224 | 64 | 1.278 | 5.221 | 7.8413 | 90.3996 | 24.0102 | 23.2662 | | beit_base_patch16_224 | 64 | 1.5542 | 6.4013 | nan | nan | 23.5489 | 23.0905 | | eca_halonext26ts | 128 | 1.5681 | 5.9375 | 11.123 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.4726 | 1.1633 | 1.4912 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0842 | 1.1492 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.0577 | 1.2943 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.045 | 1.3028 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.038 | 1.1389 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.0011 | 1.2582 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2271 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 0.9796 | 0.9842 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9745 | 1.0806 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 0.9567 | 1.1357 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9484 | 1.057 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3454 | nan | 0.9464 | 0.9678 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 0.9295 | 1.0969 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9289 | 0.9803 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9288 | 0.9904 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 0.9185 | 1.2283 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.268 | 0.3766 | 0.9137 | 1.123 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9066 | 0.9846 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 0.8962 | 1.1046 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8815 | 0.98 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.4761 | 0.8765 | 1.1944 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.8723 | 1.0161 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.8648 | 1.0056 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.852 | 0.9728 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.8455 | 0.944 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8442 | 0.965 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8369 | 0.9121 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8174 | 1.0976 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8146 | 0.9442 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8239 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.8041 | 1.0135 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8022 | 1.0059 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.7927 | 0.9534 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.787 | 0.9293 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7834 | 1.0066 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.7727 | 0.9234 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.7713 | 0.9528 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.7706 | 1.0052 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.7697 | 0.9414 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.7605 | 0.9421 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.75 | 0.9634 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3213 | 0.5513 | 0.7318 | 0.8133 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.7239 | 0.9336 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.7101 | 0.9306 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.6955 | 0.8352 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6668 | 0.8553 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.6615 | 0.9434 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.5858 | 0.8993 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5572 | 0.8383 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.7697 | 297.1683 | 321.5114 | nan | 403.5664 | 420.5966 | | hrnet_w18 | 128 | 294.9829 | 291.1056 | 345.6694 | nan | 255.0965 | 247.3179 | | res2next50 | 128 | 138.7817 | 139.0436 | 166.2847 | 120.6382 | 235.4673 | 232.5186 | | dla102 | 128 | 178.5977 | 179.2824 | 213.221 | 135.9005 | 221.2803 | 215.9348 | | resnest101e | 64 | 163.5828 | 165.6997 | 199.9947 | nan | 208.7786 | 203.2018 | | res2net50_14w_8s | 128 | 146.0897 | 147.1109 | 179.9084 | 146.3919 | 208.7475 | 234.16 | | pnasnet5large | 16 | 216.8238 | 211.9809 | 259.2211 | nan | 202.1441 | 212.1246 | | tf_mixnet_l | 128 | 194.9132 | 210.7685 | 240.387 | nan | 193.2213 | 189.2408 | | mixnet_l | 128 | 186.8672 | 202.3692 | 230.2744 | nan | 184.0206 | 190.2624 | | tnt_s_patch16_224 | 128 | 364.0326 | 364.7855 | nan | nan | 182.7318 | 185.3667 | | swsl_resnext101_32x16d | 32 | 117.9094 | 119.9779 | 145.8708 | nan | 168.8316 | 163.9821 | | poolformer_m36 | 64 | 149.1487 | 149.3387 | 184.445 | nan | 154.1478 | 154.4627 | | eca_botnext26ts_256 | 128 | 112.1774 | 135.6328 | 164.2932 | 102.8068 | 153.7677 | 171.697 | | res2net101_26w_4s | 64 | 120.1867 | 128.7236 | 126.4067 | nan | 149.7372 | 165.5082 | | adv_inception_v3 | 128 | 161.0881 | 162.0257 | 188.8679 | 140.8068 | 149.4055 | 146.6764 | | inception_v3 | 128 | 160.953 | 161.9174 | 188.5273 | 140.3555 | 147.0034 | 149.4485 | | botnet26t_256 | 128 | 106.0589 | 106.705 | 127.8482 | 81.2181 | 146.5526 | 161.4843 | | gluon_inception_v3 | 128 | 161.5356 | 162.0542 | 188.9401 | 141.1717 | 142.0415 | 149.5255 | | dpn107 | 32 | 113.9887 | 115.6489 | 143.7642 | nan | 140.3244 | 144.61 | | visformer_small | 128 | 98.3022 | 97.9753 | 116.707 | nan | 136.9645 | 141.5701 | | gluon_xception65 | 32 | 98.0844 | 99.067 | 129.5096 | nan | 136.7994 | 136.1166 | | convit_base | 64 | 181.4558 | 181.9079 | 217.5441 | 146.8938 | 133.4562 | 136.0336 | | rexnet_100 | 128 | 91.1575 | 103.1981 | 127.247 | nan | 125.1931 | 111.4959 | | pit_b_224 | 64 | 154.8762 | 155.6825 | 188.5666 | 159.256 | 121.1729 | 121.4731 | | mobilevit_s | 64 | 90.001 | 107.5898 | 133.6958 | nan | 120.4938 | 110.164 | | sebotnet33ts_256 | 64 | 83.5157 | 96.1638 | 118.4585 | 83.0469 | 115.4884 | 122.9392 | | fbnetv3_b | 128 | 121.3262 | 122.6867 | 150.1218 | nan | 115.1596 | 111.4516 | | cait_m36_384 | 4 | 166.6868 | 202.2293 | nan | nan | 112.7174 | 117.1519 | | beit_base_patch16_224 | 64 | 135.1268 | 137.868 | nan | nan | 111.7814 | 112.3391 | | cspdarknet53 | 64 | 96.0032 | 96.7962 | 119.7237 | 79.2983 | 109.8749 | 111.3955 | | vit_base_patch16_224 | 64 | 120.6907 | 121.2998 | 144.4813 | 134.3541 | 106.8588 | 108.2305 | | repvgg_a2 | 128 | 79.6864 | 80.4197 | 94.223 | 70.288 | 101.8245 | 99.1143 | | convnext_base | 64 | 121.5803 | 121.793 | 151.6839 | nan | 101.2612 | 104.6234 | | dm_nfnet_f0 | 128 | 131.172 | 131.2116 | 148.8832 | 142.5257 | 101.1815 | 103.7436 | | tf_efficientnet_b0 | 128 | 90.9364 | 108.3889 | 131.5152 | 92.3604 | 100.5724 | 106.0497 | | swin_base_patch4_window7_224 | 64 | 147.4127 | 153.4199 | nan | nan | 100.2435 | 101.1735 | | mixer_b16_224 | 128 | 118.6987 | 118.9762 | 147.833 | 131.64 | 92.9944 | 94.2506 | | gernet_l | 128 | 79.6669 | 80.5258 | 98.6784 | 71.1897 | 91.9712 | 94.7705 | | tinynet_a | 128 | 75.4203 | 94.1188 | 111.5523 | 94.6595 | 91.1478 | 93.6034 | | gmlp_s16_224 | 128 | 136.2945 | 136.9821 | 173.527 | 134.1638 | 88.6917 | 90.2133 | | jx_nest_base | 32 | 119.3048 | 120.1374 | 148.1263 | nan | 85.3344 | 87.476 | | ese_vovnet19b_dw | 128 | 67.8736 | 68.3552 | 85.9866 | 58.9183 | 85.2963 | 100.6764 | | volo_d1_224 | 64 | 134.6462 | 135.2992 | 159.2025 | nan | 84.2736 | 85.8346 | | nfnet_l0 | 128 | 106.1356 | 130.5619 | 148.665 | 124.9344 | 82.762 | 86.5599 | | crossvit_9_240 | 128 | 109.1796 | 109.6847 | 130.5031 | 119.0816 | 82.5637 | 84.1938 | | fbnetc_100 | 128 | 88.0002 | 88.9811 | 105.5185 | 71.9883 | 77.553 | 92.0757 | | gmixer_24_224 | 128 | 119.8884 | 136.1449 | 166.5415 | 129.6803 | 77.4502 | 79.6067 | | mobilenetv2_100 | 128 | 67.6156 | 68.3582 | 89.0555 | 59.7697 | 76.9868 | 98.4838 | | deit_base_distilled_patch16_224 | 64 | 94.2163 | 94.9393 | 118.0714 | 96.8372 | 73.496 | 74.6797 | | spnasnet_100 | 128 | 76.7778 | 77.4042 | 93.2783 | 66.6985 | 71.7529 | 72.3845 | | selecsls42b | 128 | 62.9099 | 63.1587 | 74.677 | 48.9028 | 70.1262 | 69.549 | | twins_pcpvt_base | 64 | 124.7243 | 137.2149 | 138.1791 | nan | 70.054 | 76.7558 | | coat_lite_mini | 128 | 116.0761 | 116.5335 | 137.1014 | 100.5763 | 64.5716 | 65.9748 | | mnasnet_100 | 128 | 70.1539 | 70.8226 | 84.9519 | 55.9177 | 61.3709 | 62.7822 | | xcit_large_24_p8_224 | 5 | 125.369 | nan | nan | nan | 61.3661 | 75.3078 | | resmlp_12_224 | 128 | 68.2517 | 68.4352 | 87.1849 | 45.7688 | 61.154 | 62.14 | | ghostnet_100 | 128 | 95.269 | 97.7691 | 107.9392 | 95.2456 | 60.0886 | 67.2124 | | mobilenetv3_large_100 | 128 | 65.9704 | 66.7281 | 80.5805 | 64.5082 | 58.9767 | 58.2766 | | regnety_002 | 128 | 53.9643 | 58.5418 | 47.6851 | 62.7317 | 40.3054 | 43.7991 | | lcnet_050 | 128 | 34.3501 | 34.5586 | 38.765 | 31.8657 | 22.3011 | 24.7471 | | eca_halonext26ts | 128 | 115.9141 | 139.5016 | 167.8366 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/huggingface_amp.png : ![](https://i.imgur.com/ZJcTbzw.png) bench_logs/torchbench_amp.png : ![](https://i.imgur.com/DSRfIpw.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/x1fr10q.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 55/56 | 100%, 46/46 | 100%, 61/61 |
|       aot_eager        | 95%, 53/56 | 100%, 46/46 | 100%, 61/61 |
|     aot_cudagraphs     | 75%, 42/56 | 35%, 16/46  | 46%, 28/61  |
|    nvprims_nvfuser     | 77%, 43/56 | 61%, 28/46  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 85%, 39/46  | 95%, 58/61  |
| inductor_no_cudagraphs | 89%, 50/56 | 93%, 43/46  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.00x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.04x    |    1.14x    |
|        inductor        |   1.46x    |    1.23x    |    1.23x    |
| inductor_no_cudagraphs |   1.23x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.71    |    2.81     |    2.02     |
|       aot_eager        |    5.86    |    8.96     |    7.76     |
|     aot_cudagraphs     |    8.49    |    16.03    |    13.40    |
|    nvprims_nvfuser     |   60.36    |    87.50    |   139.45    |
|        inductor        |   32.95    |    36.69    |    37.10    |
| inductor_no_cudagraphs |   32.72    |    31.66    |    36.09    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.92x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.90x    |    1.01x    |    0.95x    |
|        inductor        |   0.83x    |    0.74x    |    0.97x    |
| inductor_no_cudagraphs |   0.99x    |    1.00x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Previous report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_float32_205 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 82%, 46/56 | 84%, 47/56 | | inductor | huggingface | 83%, 39/47 | 87%, 40/46 | | inductor | timm_models | 93%, 57/61 | 93%, 57/61 | | inductor_no_cudagraphs | torchbench | 89%, 50/56 | 89%, 50/56 | | inductor_no_cudagraphs | huggingface | 87%, 41/47 | 93%, 43/46 | | inductor_no_cudagraphs | timm_models | 93%, 57/61 | 93%, 57/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.47x | 1.47x | | inductor | huggingface | 1.23x | 1.23x | | inductor | timm_models | 1.23x | 1.23x | | inductor_no_cudagraphs | torchbench | 1.23x | 1.23x | | inductor_no_cudagraphs | huggingface | 1.22x | 1.21x | | inductor_no_cudagraphs | timm_models | 1.23x | 1.23x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+---------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v2_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | deit_base_distilled_patch16_224 | pass | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | +-------------+---------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7673 | 0.9178 | | torchbench | soft_actor_critic | 1.4018 | 0.9094 | | torchbench | dlrm | 0.947 | 1.2426 | | torchbench | nvidia_deeprecommender | 0.8356 | 0.8839 | | torchbench | hf_GPT2_large | 0.0 | 1.4709 | | torchbench | hf_T5 | 0.0 | 1.5271 | | torchbench | tacotron2 | 0.0 | 0.9022 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 0.9912 | 0.8262 | | huggingface | DebertaV2ForQuestionAnswering | 0.9218 | 0.8877 | | huggingface | TrOCRForCausalLM | 0.0 | 1.0215 | | huggingface | AlbertForMaskedLM | 0.0 | 1.2543 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.0151 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5101 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | yolov3 | 362.526 | 361.0539 | | torchbench | timm_efficientdet | 121.3491 | 119.775 | | huggingface | XLNetLMHeadModel | 140.3607 | 165.5727 | | huggingface | DebertaV2ForQuestionAnswering | 133.3495 | 51.0009 | | huggingface | DebertaV2ForMaskedLM | 133.2589 | 50.3365 | +-------------+-------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0014 | | torchbench | mobilenet_v3_large | 0.8675 | 0.896 | | torchbench | hf_T5_large | 0.8643 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | resnet50 | 0.8566 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | pytorch_unet | 0.8484 | 1.0138 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.8263 | 1.0815 | | torchbench | dlrm | 0.7932 | 0.8152 | | torchbench | hf_Albert | 0.7685 | 1.2076 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.7501 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | LearningToPaint | 0.7133 | 0.9399 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5647 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4213 | 0.4334 | | torchbench | dcgan | 0.2564 | 0.2576 | | huggingface | YituTechConvBert | 0.894 | 0.9822 | | huggingface | DistillGPT2 | 0.8939 | 1.0108 | | huggingface | M2M100ForConditionalGeneration | 0.874 | 1.0131 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4307 | | huggingface | PegasusForConditionalGeneration | 0.8637 | 1.0262 | | huggingface | PLBartForCausalLM | 0.8367 | 1.0581 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | T5ForConditionalGeneration | 0.8129 | 1.1049 | | huggingface | T5Small | 0.8129 | 1.1049 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | MBartForConditionalGeneration | 0.7896 | 0.9837 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | MT5ForConditionalGeneration | 0.7748 | 0.9324 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.958 | | huggingface | MegatronBertForQuestionAnswering | 0.7709 | 1.0379 | | huggingface | MegatronBertForCausalLM | 0.7673 | 1.0153 | | huggingface | MBartForCausalLM | 0.7326 | 0.9478 | | huggingface | RobertaForQuestionAnswering | 0.7273 | 1.0274 | | huggingface | BertForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | BertForMaskedLM | 0.6945 | 0.9772 | | huggingface | CamemBert | 0.6942 | 0.9746 | | huggingface | RobertaForCausalLM | 0.6942 | 0.9771 | | huggingface | Speech2Text2ForCausalLM | 0.675 | 0.9168 | | huggingface | DistilBertForQuestionAnswering | 0.6589 | 0.9118 | | huggingface | DistilBertForMaskedLM | 0.6509 | 0.9194 | | huggingface | DebertaV2ForMaskedLM | 0.5682 | 0.9491 | | huggingface | MobileBertForMaskedLM | 0.4951 | 0.6649 | | huggingface | DebertaV2ForQuestionAnswering | 0.4735 | 0.984 | | huggingface | MobileBertForQuestionAnswering | 0.4145 | 0.535 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | BlenderbotForCausalLM | nan | 0.8509 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8525 | 1.0752 | | timm_models | convnext_base | 0.8485 | 1.0335 | | timm_models | sebotnet33ts_256 | 0.8189 | 0.9416 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | coat_lite_mini | 0.8154 | 1.0235 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7449 | 0.9008 | | timm_models | crossvit_9_240 | 0.6745 | 0.9137 | | timm_models | tnt_s_patch16_224 | nan | 0.8633 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Performance speedup regressions ~~~ +------------------------+---------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+---------------+-------------+------------+ | inductor | dlrm | 0.9839 | 0.947 | | inductor_no_cudagraphs | lennard_jones | 0.9683 | 0.9178 | +------------------------+---------------+-------------+------------+ ~~~ No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Performance speedup regressions ~~~ +----------+-------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+-------------------+-------------+------------+ | inductor | AlbertForMaskedLM | 1.2523 | 0.0 | +----------+-------------------+-------------+------------+ ~~~ No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 No regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0025 | 1.0275 | 2.3361 | 0.7808 | 5.209 | 1.2747 | | timm_efficientdet | 1 | 0.9816 | 0.8955 | 1.8669 | 0.7662 | 4.3559 | 1.5726 | | timm_vision_transformer | 8 | 1.0046 | 0.9465 | 1.5352 | 0.6799 | 2.5942 | 1.3962 | | drq | 1 | 0.9855 | 0.8657 | 1.6341 | 0.7277 | 2.4727 | 1.0992 | | BERT_pytorch | 16 | 1.0159 | 0.8944 | 1.1385 | 0.9554 | 2.1253 | 2.1032 | | resnext50_32x4d | 8 | 1.0035 | 1.1042 | 1.3286 | 0.8015 | 1.9937 | 1.2068 | | mobilenet_v3_large | 32 | 1.0011 | 1.1114 | 1.0275 | 0.8708 | 1.9811 | 1.3494 | | squeezenet1_1 | 32 | 0.9959 | 1.0134 | 1.0535 | 0.8761 | 1.9556 | 1.2956 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9948 | 1.0256 | 1.3399 | 0.8852 | 1.9129 | 1.5654 | | dcgan | 32 | 0.9655 | 1.0235 | 1.256 | 0.801 | 1.8775 | 1.0102 | | resnet18 | 16 | 0.9998 | 1.1397 | 1.1762 | 0.8792 | 1.849 | 1.2444 | | pytorch_struct | 200 | 0.9885 | 0.7566 | 0.8857 | 0.7708 | 1.7802 | 1.1359 | | lennard_jones | 1000 | 0.9428 | 0.8459 | 1.0108 | 0.6696 | 1.7673 | 0.9178 | | hf_Albert | 8 | 0.9998 | 0.9977 | 0.7518 | 1.5582 | 1.6629 | 1.6469 | | hf_T5_large | 2 | 1.0245 | 0.9117 | 0.0 | 0.0 | 1.6442 | 1.6399 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0605 | 0.811 | 0.8958 | 1.6062 | 1.4271 | | timm_resnest | 32 | 0.999 | 1.0021 | 0.8038 | 1.1644 | 1.5195 | 1.4501 | | hf_GPT2 | 4 | 1.0087 | 0.9777 | 0.7412 | 0.3973 | 1.4922 | 1.5011 | | mnasnet1_0 | 32 | 0.9998 | 1.0931 | 0.8573 | 0.9208 | 1.4684 | 1.3165 | | mobilenet_v2 | 96 | 0.9995 | 0.9992 | 0.7308 | 1.3355 | 1.428 | 1.403 | | fastNLP_Bert | 6 | 0.997 | 0.9776 | 0.7515 | 1.1492 | 1.4137 | 1.3825 | | speech_transformer | 32 | 0.992 | 0.8849 | 1.3531 | 0.743 | 1.4062 | 1.4291 | | soft_actor_critic | 256 | 0.9521 | 0.7747 | 1.0435 | 0.6535 | 1.4018 | 0.9094 | | timm_efficientnet | 32 | 0.9567 | 0.8185 | 0.7037 | 0.8231 | 1.3433 | 1.1964 | | pytorch_stargan | 16 | 0.9987 | 1.0778 | 0.9339 | 0.0 | 1.2684 | 1.231 | | LearningToPaint | 96 | 1.0001 | 1.0501 | 0.8517 | 0.9767 | 1.2661 | 1.1918 | | resnet152 | 32 | 1.0014 | 1.061 | 0.8049 | 0.8964 | 1.232 | 1.2066 | | hf_Bert | 4 | 1.0266 | 1.0007 | 0.7366 | 0.861 | 1.2147 | 1.1885 | | resnet50 | 32 | 0.9988 | 0.9922 | 0.7581 | 0.9507 | 1.2038 | 1.1676 | | timm_nfnet | 128 | 0.9993 | 1.0 | 0.0 | 1.1323 | 1.1947 | 1.1628 | | hf_Bart | 4 | 1.013 | 0.9739 | 0.739 | 0.8432 | 1.194 | 1.2051 | | pytorch_unet | 1 | 0.9997 | 0.2809 | 0.0 | 0.0 | 1.192 | 1.1779 | | hf_DistilBert | 8 | 1.0001 | 0.9561 | 0.6865 | 0.5211 | 1.176 | 1.1826 | | vgg16 | 64 | 0.9997 | 0.9987 | 0.8582 | 0.9976 | 1.1742 | 1.1693 | | alexnet | 128 | 0.9989 | 0.9972 | 0.8024 | 1.0055 | 1.1574 | 1.1607 | | Super_SloMo | 6 | 0.9999 | 0.2409 | 0.0 | 0.2467 | 1.1535 | 1.1388 | | hf_Reformer | 4 | 0.997 | 1.0013 | 0.9882 | 0.7357 | 1.1316 | 1.1433 | | timm_regnet | 32 | 0.9647 | 0.9619 | 0.7772 | 1.0941 | 1.13 | 1.093 | | yolov3 | 16 | 0.9997 | 0.9952 | 0.7903 | 1.1527 | 1.0872 | 1.0737 | | Background_Matting | 4 | 1.0002 | 0.1909 | 0.0 | 0.0 | 1.0871 | 1.0785 | | mobilenet_v2_quantized_qat | 96 | 1.0017 | 0.9777 | 0.0 | 1.4568 | 1.0732 | 1.0616 | | attention_is_all_you_need_pytorch | 256 | 1.0011 | 0.9712 | 0.7559 | 0.9532 | 1.0544 | 1.0398 | | timm_vision_transformer_large | 8 | 0.9994 | 0.9937 | 0.0 | 0.0 | 1.0461 | 1.0324 | | timm_vovnet | 32 | 0.911 | 0.9048 | 0.7131 | 0.8998 | 1.0039 | 1.02 | | tts_angular | 64 | 0.9784 | 0.9507 | 0.9723 | 0.9638 | 1.0033 | 1.0046 | | resnet50_quantized_qat | 32 | 1.001 | 0.9669 | 0.0 | 1.1532 | 1.0024 | 0.9587 | | demucs | 4 | 1.0 | 0.9997 | 1.0003 | 0.9994 | 0.9996 | 0.9999 | | dlrm | 1024 | 1.3777 | 0.7229 | 0.0 | 0.9474 | 0.947 | 1.2426 | | nvidia_deeprecommender | 256 | 0.9989 | 0.963 | 0.5853 | 0.9757 | 0.8356 | 0.8839 | | hf_GPT2_large | 4 | 1.0002 | 0.9806 | 0.0 | 0.0 | 0.0 | 1.4709 | | hf_T5 | 8 | 1.0013 | 0.9529 | 0.0 | 1.1682 | 0.0 | 1.5271 | | tacotron2 | 64 | 0.9543 | 0.8534 | 0.0 | 0.7536 | 0.0 | 0.9022 | | functorch_dp_cifar10 | 64 | 0.9991 | 1.0259 | 2.1533 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.924 | 0.8695 | 0.7931 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9516 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 1.6244 | 6.4132 | 9.4638 | 110.2689 | 362.526 | 361.0539 | | timm_efficientdet | 1 | 10.2479 | 25.8479 | 54.0186 | 440.4384 | 121.3491 | 119.775 | | hf_T5_large | 2 | 13.5563 | 36.6451 | nan | nan | 111.3235 | 108.7207 | | mobilenet_v2_quantized_qat | 96 | 1.3724 | 8.581 | nan | 179.6501 | 85.1732 | 85.2512 | | resnet50_quantized_qat | 32 | 1.2045 | 8.3413 | nan | 167.7879 | 70.2078 | 70.6077 | | timm_nfnet | 128 | 2.1615 | 6.7839 | nan | 142.3706 | 60.432 | 59.806 | | timm_efficientnet | 32 | 1.8864 | 6.1602 | 13.8124 | 106.3218 | 60.4315 | 61.0788 | | timm_resnest | 32 | 0.6085 | 2.2276 | 3.3015 | 60.3906 | 58.6027 | 57.8885 | | timm_vision_transformer_large | 8 | 2.5451 | 12.3905 | nan | nan | 53.6338 | 53.022 | | mobilenet_v3_large | 32 | 0.8981 | 4.1806 | 6.1497 | 97.5414 | 50.4259 | 48.6422 | | timm_regnet | 32 | 2.335 | 7.2521 | 16.6693 | 110.5907 | 45.2966 | 43.2784 | | densenet121 | 4 | 2.1764 | 10.9128 | 16.7846 | 148.1931 | 44.9095 | 42.6823 | | attention_is_all_you_need_pytorch | 256 | 1.2702 | 6.276 | 9.4972 | 96.781 | 43.5029 | 42.683 | | resnet152 | 32 | 2.4294 | 11.9022 | 18.9057 | 157.9437 | 41.9148 | 41.3479 | | timm_vision_transformer | 8 | 0.8531 | 3.7494 | 5.2716 | 75.675 | 32.73 | 31.7699 | | hf_Bart | 4 | 1.7938 | 7.3882 | 10.7619 | 104.1061 | 30.2263 | 29.9771 | | BERT_pytorch | 16 | 1.4646 | 6.4991 | 9.551 | 76.4026 | 27.2064 | 26.75 | | fastNLP_Bert | 6 | 1.5502 | 6.0174 | 9.0 | 72.0367 | 27.1253 | 25.7772 | | pytorch_stargan | 16 | 0.4239 | 1.904 | 2.625 | nan | 26.744 | 25.6396 | | speech_transformer | 32 | 1.6926 | 7.2451 | 24.4905 | 101.2624 | 24.7412 | 23.9783 | | pytorch_struct | 200 | 0.2583 | 0.6848 | 1.2279 | 4.8331 | 22.3922 | 22.1335 | | Super_SloMo | 6 | 1.1417 | 6.9757 | nan | 55.2006 | 19.736 | 19.3573 | | hf_Bert | 4 | 1.6252 | 5.7907 | 8.2819 | 77.4122 | 19.5637 | 19.1868 | | mnasnet1_0 | 32 | 0.829 | 3.818 | 5.6469 | 70.0017 | 18.8237 | 18.4965 | | hf_Albert | 8 | 1.2987 | 5.203 | 7.949 | 103.5644 | 18.533 | 17.7894 | | hf_Reformer | 4 | 1.7077 | 2.8892 | 5.0662 | 15.0922 | 18.3774 | 15.8063 | | timm_vovnet | 32 | 1.4387 | 4.0385 | 8.8071 | 57.6161 | 18.3264 | 17.6522 | | hf_GPT2 | 4 | 1.6324 | 5.554 | 8.1491 | 64.2714 | 18.2718 | 17.372 | | resnet50 | 32 | 0.8809 | 4.1529 | 6.219 | 76.8176 | 18.2428 | 17.5044 | | shufflenet_v2_x1_0 | 128 | 0.9675 | 4.5755 | 6.8255 | 84.8871 | 18.1676 | 17.6963 | | resnext50_32x4d | 8 | 0.9099 | 4.1031 | 6.0875 | 66.9129 | 17.5956 | 17.3751 | | Background_Matting | 4 | 0.7609 | 8.2745 | nan | nan | 17.4002 | 17.6479 | | mobilenet_v2 | 96 | 0.8347 | 4.1157 | 6.2122 | 93.241 | 16.8814 | 16.7187 | | hf_DistilBert | 8 | 0.6752 | 2.7496 | 4.9001 | 46.5885 | 12.096 | 11.9121 | | resnet18 | 16 | 0.4203 | 1.6258 | 2.3462 | 30.4946 | 10.7908 | 10.5666 | | pytorch_unet | 1 | 0.4652 | 2.9453 | nan | nan | 9.0229 | 9.1561 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3765 | 1.766 | 2.4188 | 32.5731 | 8.3053 | 8.244 | | LearningToPaint | 96 | 0.4422 | 1.698 | 2.547 | 38.6462 | 7.1252 | 6.8195 | | dcgan | 32 | 0.173 | 0.3938 | 0.5934 | 4.5574 | 5.978 | 5.8136 | | squeezenet1_1 | 32 | 0.2269 | 0.7142 | 1.0474 | 4.6575 | 4.9613 | 4.6961 | | vgg16 | 64 | 0.1747 | 0.5152 | 0.8139 | 3.2087 | 3.7701 | 3.6199 | | drq | 1 | 0.3123 | 0.5342 | 0.8762 | 4.6611 | 3.6989 | 3.2766 | | soft_actor_critic | 256 | 0.2122 | 0.3229 | 0.5266 | 1.7796 | 3.4944 | 2.7416 | | alexnet | 128 | 0.1552 | 0.346 | 0.5797 | 3.1147 | 3.3913 | 3.4157 | | nvidia_deeprecommender | 256 | 0.1901 | 0.3688 | 0.6108 | 6.3969 | 3.2669 | 3.012 | | dlrm | 1024 | 0.2592 | 0.5741 | nan | 3.2611 | 2.9421 | 2.6802 | | lennard_jones | 1000 | 0.141 | 0.2593 | 0.405 | 1.5349 | 2.2544 | 1.766 | | tts_angular | 64 | 0.1815 | 0.2289 | 0.3642 | 1.176 | 1.8855 | 1.6496 | | demucs | 4 | 0.3007 | 0.3075 | 0.3074 | 0.3085 | 0.2105 | 0.2147 | | hf_GPT2_large | 4 | 5.6183 | 17.6574 | nan | nan | nan | 45.0548 | | tacotron2 | 64 | 4.38 | 16.0713 | nan | 47.9167 | nan | 44.8092 | | hf_T5 | 8 | 2.5565 | 8.2467 | nan | 68.4548 | nan | 28.1853 | | hf_Longformer | 2 | 6.4666 | 13.8639 | 55.7102 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 0.3038 | 1.1973 | 1.774 | nan | nan | nan | | hf_BigBird | 2 | 3.6011 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5274 | 1.5274 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4604 | 1.4599 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | 0.988 | 1.3048 | 1.3922 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1744 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2889 | 1.1347 | 1.1153 | 1.1438 | | Super_SloMo | 6 | 1.0024 | 0.9018 | nan | 0.9454 | 1.1138 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 1.0433 | 1.1066 | | speech_transformer | 32 | 0.9974 | 0.9772 | 0.2739 | 1.1206 | 1.0396 | 1.0444 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 0.9436 | 1.0969 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9405 | 1.0771 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.923 | 1.1042 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9162 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9107 | 1.0395 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9066 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9927 | 0.88 | 0.3235 | 0.7926 | 0.8982 | 1.0014 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8675 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8643 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8566 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.8484 | 1.0138 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8779 | 0.3387 | 1.0865 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3502 | 1.1281 | 0.8263 | 1.0815 | | dlrm | 1024 | 0.8149 | 0.8149 | nan | 0.8147 | 0.7932 | 0.8152 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.7685 | 1.2076 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3307 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.7501 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | LearningToPaint | 96 | 0.9442 | 0.6917 | 0.3384 | 0.6268 | 0.7133 | 0.9399 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3524 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3211 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9996 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5647 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4213 | 0.4334 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1617 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1436 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_Longformer | 2 | 0.9999 | 0.9962 | 0.2947 | nan | nan | nan | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 1024 | 177.713 | 213.1619 | nan | 232.2416 | 221.9955 | 174.8383 | | timm_vision_transformer_large | 8 | 196.3836 | 197.6102 | nan | nan | 188.463 | 190.8425 | | timm_nfnet | 128 | 206.2835 | 206.371 | nan | 181.7598 | 172.627 | 177.4642 | | Background_Matting | 4 | 186.7922 | 977.4423 | nan | nan | 171.8054 | 173.2929 | | mobilenet_v2_quantized_qat | 96 | 147.2276 | 151.7386 | nan | 101.5678 | 143.4336 | 139.8411 | | hf_T5_large | 2 | 189.6419 | 216.0389 | nan | nan | 119.9329 | 123.9996 | | Super_SloMo | 6 | 117.6228 | 487.6006 | nan | 476.2248 | 102.0176 | 103.342 | | yolov3 | 16 | 102.6676 | 103.1097 | 129.6884 | 88.9924 | 94.4158 | 95.3483 | | resnet50_quantized_qat | 32 | 92.9111 | 97.5514 | nan | 81.511 | 94.2324 | 98.5377 | | vgg16 | 64 | 106.3564 | 106.5444 | 124.0789 | 106.6404 | 90.7155 | 91.0891 | | timm_regnet | 32 | 101.2083 | 101.5814 | 126.0897 | 89.7296 | 86.8796 | 89.5609 | | demucs | 4 | 77.6184 | 77.8643 | 77.7979 | 77.8339 | 77.6693 | 78.1739 | | resnet152 | 32 | 91.0617 | 85.914 | 114.015 | 102.4892 | 73.8212 | 76.5225 | | hf_Reformer | 4 | 83.3059 | 83.2248 | 84.1218 | 113.0912 | 73.6338 | 72.967 | | attention_is_all_you_need_pytorch | 256 | 72.9166 | 74.1294 | 95.5977 | 75.4859 | 68.5822 | 69.2267 | | mobilenet_v2 | 96 | 71.6032 | 71.7233 | 98.0478 | 53.513 | 50.073 | 50.9788 | | pytorch_unet | 1 | 58.6269 | 208.3619 | nan | nan | 49.1117 | 49.6952 | | hf_Bart | 4 | 54.7979 | 56.6918 | 74.8654 | 65.3475 | 46.0292 | 46.3014 | | hf_Albert | 8 | 75.116 | 75.3329 | 100.3858 | 48.0768 | 45.6167 | 45.5637 | | fastNLP_Bert | 6 | 59.8321 | 61.2315 | 79.561 | 52.0559 | 42.4263 | 43.153 | | timm_vovnet | 32 | 42.3504 | 42.6867 | 54.0402 | 42.8962 | 38.4272 | 37.7759 | | speech_transformer | 32 | 49.0414 | 55.919 | 35.9103 | 66.1899 | 36.0091 | 40.6785 | | hf_GPT2 | 4 | 49.8972 | 51.2094 | 68.3223 | 127.2035 | 33.7538 | 33.6315 | | hf_DistilBert | 8 | 38.8402 | 40.651 | 56.6314 | 74.5639 | 33.1576 | 32.9008 | | hf_Bert | 4 | 37.9299 | 39.1466 | 53.456 | 44.8876 | 32.547 | 33.0158 | | timm_efficientdet | 1 | 138.8569 | 153.4451 | 83.2613 | 184.3483 | 32.3767 | 91.6751 | | timm_efficientnet | 32 | 44.5184 | 52.3726 | 61.3854 | 52.303 | 32.3001 | 36.4309 | | resnet50 | 32 | 38.7734 | 39.055 | 51.3754 | 41.3645 | 32.2269 | 33.2152 | | shufflenet_v2_x1_0 | 128 | 37.2987 | 35.2521 | 46.2882 | 41.9697 | 23.5384 | 26.3221 | | BERT_pytorch | 16 | 45.1943 | 52.6473 | 42.4569 | 49.0746 | 22.9688 | 23.194 | | timm_resnest | 32 | 31.7546 | 31.5467 | 39.4371 | 27.2093 | 20.8694 | 21.8423 | | mnasnet1_0 | 32 | 28.5136 | 26.0567 | 33.5078 | 31.2319 | 19.5702 | 22.9705 | | pytorch_stargan | 16 | 24.3312 | 22.5789 | 25.9614 | nan | 19.1299 | 19.7469 | | mobilenet_v3_large | 32 | 31.1028 | 28.4115 | 30.6043 | 36.9731 | 16.1682 | 24.0752 | | resnext50_32x4d | 8 | 26.6804 | 24.5882 | 22.4568 | 33.7872 | 13.5368 | 22.5627 | | densenet121 | 4 | 65.0011 | 65.4162 | 28.6961 | 86.4953 | 13.0263 | 54.0442 | | LearningToPaint | 96 | 15.5975 | 14.9134 | 18.4943 | 16.2603 | 12.4556 | 13.9434 | | alexnet | 128 | 12.4541 | 12.4668 | 15.4921 | 12.4137 | 10.7453 | 10.7377 | | nvidia_deeprecommender | 256 | 8.5592 | 8.8748 | 14.6109 | 8.7699 | 10.216 | 9.663 | | timm_vision_transformer | 8 | 23.3214 | 25.2789 | 15.7904 | 38.813 | 9.548 | 17.7359 | | tts_angular | 64 | 9.4285 | 9.5929 | 9.7465 | 9.648 | 9.3976 | 9.2549 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.5134 | 16.6917 | 12.7275 | 21.2076 | 8.8043 | 10.9733 | | squeezenet1_1 | 32 | 12.6109 | 12.7589 | 12.1228 | 14.4252 | 6.9495 | 10.1817 | | resnet18 | 16 | 11.945 | 11.6149 | 10.4186 | 13.9126 | 6.6263 | 9.963 | | pytorch_struct | 200 | 3.8875 | 5.048 | 4.3462 | 4.9097 | 2.1213 | 3.4489 | | dcgan | 32 | 2.7116 | 2.6884 | 2.1406 | 3.4302 | 1.3674 | 2.6193 | | drq | 1 | 2.9934 | 3.3381 | 1.8194 | 5.2476 | 1.235 | 2.8792 | | soft_actor_critic | 256 | 1.0284 | 1.2946 | 0.9933 | 1.5621 | 0.7743 | 1.1652 | | lennard_jones | 1000 | 1.3228 | 1.2875 | 1.0941 | 1.6353 | 0.7394 | 1.2443 | | tacotron2 | 64 | 2898.0873 | 3163.6857 | nan | 3594.1961 | nan | 3213.9216 | | hf_GPT2_large | 4 | 240.9453 | 245.9949 | nan | nan | nan | 163.9049 | | hf_T5 | 8 | 182.8 | 191.8845 | nan | 156.6735 | nan | 119.964 | | hf_Longformer | 2 | 149.6575 | 159.0756 | 175.8956 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 11.9133 | 11.5166 | 5.443 | nan | nan | nan | | hf_BigBird | 2 | 195.747 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9986 | 0.9315 | 0.0 | 0.8887 | 1.7959 | 1.8147 | | GPT2ForSequenceClassification | 4 | 0.9999 | 0.9781 | 0.0 | 0.6943 | 1.7846 | 1.7639 | | XLNetLMHeadModel | 8 | 0.9988 | 0.9682 | 0.0 | 0.0 | 1.6832 | 1.6787 | | MT5ForConditionalGeneration | 16 | 1.0201 | 0.9356 | 0.9189 | 1.0437 | 1.6578 | 1.5649 | | GoogleFnet | 16 | 0.9994 | 0.9987 | 0.0 | 1.5411 | 1.4461 | 1.5578 | | DistillGPT2 | 16 | 1.0 | 0.9526 | 0.0 | 0.9159 | 1.4391 | 1.4834 | | ElectraForQuestionAnswering | 64 | 0.9999 | 0.9827 | 0.0 | 1.1928 | 1.4259 | 1.4088 | | T5ForConditionalGeneration | 4 | 0.9991 | 0.9383 | 0.7256 | 1.1046 | 1.4187 | 1.411 | | T5Small | 4 | 1.0002 | 0.9357 | 0.7307 | 1.0939 | 1.4149 | 1.4125 | | ElectraForCausalLM | 32 | 1.0007 | 0.9313 | 0.0 | 1.0176 | 1.4118 | 1.4508 | | LayoutLMForSequenceClassification | 16 | 0.9998 | 0.9893 | 0.7377 | 1.116 | 1.3111 | 1.2909 | | BertForQuestionAnswering | 16 | 1.0 | 0.9889 | 0.7335 | 1.1144 | 1.2871 | 1.2649 | | RobertaForQuestionAnswering | 16 | 1.0001 | 0.9886 | 0.7343 | 1.1157 | 1.2864 | 1.2724 | | RobertaForCausalLM | 16 | 1.0003 | 0.971 | 0.0 | 1.0565 | 1.2717 | 1.2753 | | AlbertForQuestionAnswering | 4 | 1.0013 | 1.0025 | 0.0 | 1.2356 | 1.2648 | 1.2604 | | MobileBertForMaskedLM | 64 | 1.0206 | 0.9332 | 0.8032 | 0.0 | 1.2396 | 1.3308 | | MegatronBertForQuestionAnswering | 8 | 1.0001 | 0.9919 | 0.0 | 1.0947 | 1.2177 | 1.2039 | | MegatronBertForCausalLM | 4 | 0.9996 | 0.9833 | 0.726 | 1.0622 | 1.2111 | 1.198 | | XGLMForCausalLM | 8 | 1.009 | 0.9438 | 0.7404 | 0.3116 | 1.2014 | 1.254 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9698 | 0.0 | 1.0508 | 1.1911 | 1.196 | | BertForMaskedLM | 16 | 1.0002 | 0.9697 | 0.0 | 1.0613 | 1.1765 | 1.1812 | | MobileBertForQuestionAnswering | 128 | 1.0249 | 0.9522 | 0.0 | 0.0 | 1.176 | 1.1281 | | YituTechConvBert | 16 | 1.0 | 0.9669 | 0.0 | 1.0039 | 1.1732 | 1.1729 | | CamemBert | 16 | 1.0 | 0.969 | 0.0 | 1.0483 | 1.1713 | 1.1757 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9613 | 0.0 | 0.9583 | 1.1664 | 1.1575 | | DistilBertForQuestionAnswering | 256 | 0.9999 | 1.0004 | 0.0 | 0.787 | 1.1593 | 1.1557 | | PLBartForCausalLM | 8 | 0.9997 | 0.9483 | 0.0 | 0.9625 | 1.1299 | 1.1872 | | MBartForConditionalGeneration | 2 | 1.0008 | 0.9876 | 0.0 | 1.024 | 1.0948 | 1.0873 | | BartForConditionalGeneration | 2 | 1.0006 | 0.9882 | 0.0 | 0.4458 | 1.092 | 1.0851 | | M2M100ForConditionalGeneration | 16 | 1.0505 | 0.9561 | 0.0 | 0.9667 | 1.0887 | 1.0459 | | MBartForCausalLM | 4 | 1.0003 | 0.9654 | 0.7543 | 0.9994 | 1.0851 | 1.0937 | | BartForCausalLM | 4 | 1.0004 | 0.9665 | 0.7547 | 1.002 | 1.0818 | 1.0911 | | DebertaForMaskedLM | 4 | 0.8791 | 0.7811 | 0.7166 | 0.625 | 1.0646 | 1.0184 | | DebertaForQuestionAnswering | 8 | 0.9637 | 0.9755 | 0.6835 | 0.8498 | 1.0453 | 1.2183 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9406 | 0.0 | 0.9514 | 1.036 | 1.0431 | | PegasusForConditionalGeneration | 32 | 0.9989 | 0.9796 | 0.0 | 0.9805 | 1.0147 | 1.0125 | | DistilBertForMaskedLM | 128 | 0.9992 | 0.9526 | 0.0 | 0.8018 | 1.0138 | 1.0348 | | DebertaV2ForMaskedLM | 1 | 0.8576 | 0.7249 | 0.0 | 0.0 | 0.9912 | 0.8262 | | Speech2Text2ForCausalLM | 256 | 0.9977 | 0.9254 | 0.6517 | 0.938 | 0.9877 | 1.0233 | | PegasusForCausalLM | 32 | 0.9992 | 0.9545 | 0.7327 | 0.9495 | 0.9701 | 0.9813 | | BlenderbotSmallForCausalLM | 64 | 1.0013 | 0.9106 | 0.6829 | 0.9182 | 0.9566 | 0.9898 | | DebertaV2ForQuestionAnswering | 2 | 0.8835 | 0.823 | 0.0 | 0.615 | 0.9218 | 0.8877 | | TrOCRForCausalLM | 32 | 0.9999 | 0.9566 | 0.0 | 0.9664 | 0.0 | 1.0215 | | AlbertForMaskedLM | 4 | 1.0005 | 1.0002 | 0.0 | 1.2272 | 0.0 | 1.2543 | | BlenderbotForCausalLM | 4 | 1.0027 | 0.982 | 0.0 | 0.9476 | 0.0 | 1.0151 | | AllenaiLongformerBase | 4 | 0.9919 | 0.9482 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XLNetLMHeadModel | 8 | 4.9133 | 18.528 | nan | nan | 140.3607 | 165.5727 | | DebertaV2ForQuestionAnswering | 2 | 8.5752 | 16.7432 | nan | 93.5133 | 133.3495 | 51.0009 | | DebertaV2ForMaskedLM | 1 | 8.1444 | 16.6538 | nan | nan | 133.2589 | 50.3365 | | DebertaForQuestionAnswering | 8 | 5.3365 | 11.0417 | 34.444 | 68.2005 | 84.1908 | 37.13 | | DebertaForMaskedLM | 4 | 5.1095 | 11.2145 | 33.809 | 70.8747 | 82.4886 | 35.6257 | | XGLMForCausalLM | 8 | 2.8026 | 11.1554 | 21.8596 | 159.3747 | 63.5585 | 61.4488 | | MobileBertForQuestionAnswering | 128 | 8.4091 | 24.8163 | nan | nan | 57.1695 | 53.536 | | MobileBertForMaskedLM | 64 | 8.3886 | 25.2537 | 43.2542 | nan | 54.2294 | 52.429 | | M2M100ForConditionalGeneration | 16 | 3.6757 | 14.363 | nan | 168.4692 | 52.8055 | 52.9221 | | MT5ForConditionalGeneration | 16 | 4.0456 | 12.0199 | 19.0778 | 107.9131 | 51.7139 | 51.4011 | | BartForConditionalGeneration | 2 | 3.5101 | 14.0366 | nan | 198.3638 | 45.4712 | 43.2865 | | PegasusForConditionalGeneration | 32 | 3.2457 | 13.6786 | nan | 180.6329 | 45.0319 | 42.2502 | | MBartForConditionalGeneration | 2 | 3.4484 | 14.0277 | nan | 219.8213 | 44.2247 | 42.9868 | | YituTechConvBert | 16 | 2.3618 | 8.9164 | nan | 103.8401 | 38.9364 | 36.8212 | | MegatronBertForQuestionAnswering | 8 | 3.6142 | 11.5704 | nan | 163.3396 | 34.4369 | 32.3922 | | MegatronBertForCausalLM | 4 | 3.4075 | 12.7888 | 18.1665 | 161.6199 | 34.1526 | 34.1658 | | T5ForConditionalGeneration | 4 | 2.5205 | 8.4284 | 12.6567 | 69.4065 | 31.0328 | 30.5372 | | T5Small | 4 | 2.5426 | 8.2327 | 12.5775 | 69.4541 | 31.0241 | 30.2018 | | BlenderbotSmallForConditionalGeneration | 64 | 2.2089 | 9.0573 | nan | 130.9751 | 30.8269 | 29.3206 | | LayoutLMForSequenceClassification | 16 | 2.0584 | 6.6662 | 9.3261 | 77.2303 | 28.4223 | 27.8045 | | PLBartForConditionalGeneration | 4 | 1.7299 | 7.4733 | nan | 103.2542 | 26.9571 | 26.2395 | | ElectraForCausalLM | 32 | 1.7274 | 5.8208 | nan | 80.3266 | 26.8482 | 24.8624 | | GoogleFnet | 16 | 0.9604 | 2.9829 | nan | 44.6825 | 26.0839 | 19.9142 | | PegasusForCausalLM | 32 | 1.3554 | 5.3358 | 8.3169 | 76.8852 | 22.4432 | 21.3552 | | LayoutLMForMaskedLM | 16 | 2.0767 | 6.4448 | nan | 83.0991 | 21.8355 | 20.7364 | | MBartForCausalLM | 4 | 1.3389 | 5.3999 | 7.8492 | 78.8735 | 21.7657 | 21.1573 | | RobertaForCausalLM | 16 | 1.6187 | 5.8948 | nan | 78.0324 | 21.2395 | 19.4797 | | BertForMaskedLM | 16 | 1.6042 | 5.8516 | nan | 78.036 | 20.867 | 20.1706 | | ElectraForQuestionAnswering | 64 | 1.6451 | 5.8363 | nan | 76.7865 | 20.8658 | 20.194 | | BartForCausalLM | 4 | 1.3442 | 5.2385 | 8.0547 | 72.7312 | 20.3218 | 19.5193 | | BertForQuestionAnswering | 16 | 1.6207 | 5.7105 | 8.7352 | 74.5645 | 20.2054 | 19.5519 | | CamemBert | 16 | 1.5831 | 5.7774 | nan | 81.612 | 19.8455 | 19.1469 | | RobertaForQuestionAnswering | 16 | 1.6022 | 5.9769 | 8.5713 | 77.0205 | 19.7102 | 18.4792 | | OPTForCausalLM | 2 | 1.4093 | 5.4058 | nan | 72.6409 | 18.2135 | 17.0529 | | GPT2ForSequenceClassification | 4 | 1.668 | 5.5672 | nan | 61.9558 | 17.6863 | 16.8422 | | AlbertForQuestionAnswering | 4 | 1.4344 | 5.3077 | nan | 103.9129 | 15.9294 | 15.822 | | BlenderbotSmallForCausalLM | 64 | 0.9099 | 3.7277 | 5.3199 | 53.9813 | 14.3902 | 14.2705 | | Speech2Text2ForCausalLM | 256 | 0.7958 | 2.7837 | 4.5201 | 36.3997 | 14.2945 | 12.8505 | | DistillGPT2 | 16 | 0.8644 | 3.0062 | nan | 34.9172 | 13.8024 | 13.3757 | | PLBartForCausalLM | 8 | 0.7532 | 2.7727 | nan | 43.2618 | 13.367 | 12.7412 | | DistilBertForMaskedLM | 128 | 0.7357 | 3.027 | nan | 43.7429 | 11.5263 | 11.4659 | | DistilBertForQuestionAnswering | 256 | 0.8083 | 2.939 | nan | 41.0508 | 10.6695 | 10.5544 | | BlenderbotForCausalLM | 4 | 2.4764 | 10.2241 | nan | 154.6407 | nan | 38.2494 | | TrOCRForCausalLM | 32 | 1.2769 | 5.2062 | nan | 76.4058 | nan | 19.6307 | | AlbertForMaskedLM | 4 | 1.4496 | 5.4194 | nan | 105.4382 | nan | 15.8385 | | AllenaiLongformerBase | 4 | 6.3525 | 13.9844 | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | XLNetLMHeadModel | 8 | 1.0 | 0.9323 | nan | nan | 0.9946 | 0.9946 | | GoogleFnet | 16 | 0.9224 | 0.9224 | nan | 1.4614 | 0.9608 | 1.2768 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9344 | nan | 1.274 | 0.9316 | 1.2234 | | OPTForCausalLM | 2 | 1.0001 | 0.9258 | nan | 1.0746 | 0.9068 | 1.1143 | | YituTechConvBert | 16 | 0.9966 | 0.9341 | nan | 0.9891 | 0.894 | 0.9822 | | DistillGPT2 | 16 | 1.0 | 0.8855 | nan | 1.055 | 0.8939 | 1.0108 | | M2M100ForConditionalGeneration | 16 | 0.9896 | 0.9328 | nan | 1.0062 | 0.874 | 1.0131 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4307 | | PegasusForConditionalGeneration | 32 | 0.9981 | 0.9529 | nan | 1.1152 | 0.8637 | 1.0262 | | PLBartForCausalLM | 8 | 1.0 | 0.8896 | nan | 1.0988 | 0.8367 | 1.0581 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | T5Small | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | ElectraForCausalLM | 32 | 0.9983 | 0.883 | nan | 0.844 | 0.7929 | 0.9036 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8931 | nan | 0.9681 | 0.7896 | 0.9837 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 0.9964 | 0.7774 | 0.9692 | | MT5ForConditionalGeneration | 16 | 1.0014 | 0.8793 | 0.4387 | 0.9365 | 0.7748 | 0.9324 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.958 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9223 | nan | 1.0616 | 0.7709 | 1.0379 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9018 | 0.3475 | 0.9999 | 0.7673 | 1.0153 | | MBartForCausalLM | 4 | 1.0 | 0.9122 | 0.3642 | 1.0011 | 0.7326 | 0.9478 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0274 | | BertForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | BertForMaskedLM | 16 | 1.0 | 0.9408 | nan | 0.9928 | 0.6945 | 0.9772 | | CamemBert | 16 | 1.0 | 0.9388 | nan | 0.987 | 0.6942 | 0.9746 | | RobertaForCausalLM | 16 | 1.0 | 0.9405 | nan | 0.9926 | 0.6942 | 0.9771 | | Speech2Text2ForCausalLM | 256 | 0.9545 | 0.8398 | 0.3515 | 0.9068 | 0.675 | 0.9168 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9602 | nan | 1.1897 | 0.6589 | 0.9118 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8847 | nan | 0.8827 | 0.6509 | 0.9194 | | DebertaV2ForMaskedLM | 1 | 1.0 | 0.9651 | nan | nan | 0.5682 | 0.9491 | | MobileBertForMaskedLM | 64 | 1.0 | 0.906 | 0.3175 | nan | 0.4951 | 0.6649 | | DebertaV2ForQuestionAnswering | 2 | 0.9842 | 0.9842 | nan | 0.9842 | 0.4735 | 0.984 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9909 | nan | nan | 0.4145 | 0.535 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | nan | 1.3563 | | TrOCRForCausalLM | 32 | 1.0 | 0.8787 | nan | 0.9998 | nan | 0.9239 | | BlenderbotForCausalLM | 4 | 1.0001 | 0.8057 | nan | 0.8218 | nan | 0.8509 | | AllenaiLongformerBase | 4 | 0.999 | 0.9336 | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 381.0621 | 380.7001 | nan | 308.3589 | 302.6602 | 303.529 | | XLNetLMHeadModel | 8 | 370.2743 | 385.6607 | nan | nan | 221.4327 | 221.8029 | | PegasusForConditionalGeneration | 32 | 176.0702 | 179.8749 | nan | 179.5462 | 173.9754 | 174.133 | | MegatronBertForQuestionAnswering | 8 | 172.389 | 173.6514 | nan | 157.5663 | 141.8874 | 143.2513 | | BartForConditionalGeneration | 2 | 150.2528 | 151.9465 | nan | 337.206 | 137.5308 | 138.2737 | | MBartForConditionalGeneration | 2 | 150.0773 | 152.1918 | nan | 146.5697 | 137.0879 | 138.0503 | | YituTechConvBert | 16 | 155.7316 | 160.7233 | nan | 154.8109 | 132.8576 | 132.4264 | | MobileBertForQuestionAnswering | 128 | 138.813 | 150.0288 | nan | nan | 126.2709 | 129.5014 | | DistilBertForQuestionAnswering | 256 | 144.4451 | 144.3264 | nan | 183.2155 | 124.8539 | 125.2722 | | MobileBertForMaskedLM | 64 | 147.0804 | 162.6361 | 189.1973 | nan | 122.4293 | 129.8878 | | DistilBertForMaskedLM | 128 | 122.1832 | 128.1681 | nan | 152.3372 | 120.5268 | 118.0147 | | CamemBert | 16 | 135.612 | 139.9899 | nan | 129.6175 | 115.9806 | 115.3196 | | BlenderbotSmallForConditionalGeneration | 64 | 118.9571 | 126.732 | nan | 125.2687 | 115.2782 | 114.2554 | | LayoutLMForMaskedLM | 16 | 136.9831 | 141.2707 | nan | 130.4235 | 115.2269 | 114.5503 | | DebertaV2ForQuestionAnswering | 2 | 128.9744 | 128.0294 | nan | 171.7675 | 114.5363 | 119.2593 | | BertForMaskedLM | 16 | 134.2807 | 138.5652 | nan | 126.6595 | 114.439 | 113.7245 | | BartForCausalLM | 4 | 123.4809 | 127.6346 | 163.8735 | 123.3734 | 114.1614 | 112.9907 | | MBartForCausalLM | 4 | 123.3906 | 127.7123 | 163.8207 | 123.4173 | 113.8281 | 112.6908 | | RobertaForCausalLM | 16 | 142.618 | 146.8223 | nan | 134.9863 | 113.0583 | 111.7282 | | M2M100ForConditionalGeneration | 16 | 118.9566 | 122.3482 | nan | 121.2193 | 107.3355 | 111.9611 | | PLBartForConditionalGeneration | 4 | 121.3271 | 126.4093 | nan | 127.3512 | 104.8957 | 104.4312 | | PLBartForCausalLM | 8 | 118.2104 | 124.9725 | nan | 122.9743 | 102.2395 | 99.7083 | | OPTForCausalLM | 2 | 169.4412 | 181.9837 | nan | 190.2532 | 93.9089 | 93.1587 | | DebertaV2ForMaskedLM | 1 | 102.2571 | 120.5816 | nan | nan | 89.51 | 106.3586 | | PegasusForCausalLM | 32 | 85.632 | 89.7047 | 116.923 | 90.0688 | 88.4626 | 87.334 | | ElectraForQuestionAnswering | 64 | 124.907 | 128.0433 | nan | 104.6928 | 87.6155 | 88.6404 | | RobertaForQuestionAnswering | 16 | 110.9832 | 112.3912 | 151.3189 | 99.4674 | 86.5785 | 87.3006 | | LayoutLMForSequenceClassification | 16 | 113.2862 | 114.4973 | 153.4511 | 101.5353 | 86.5201 | 87.6427 | | BertForQuestionAnswering | 16 | 110.7169 | 111.7699 | 151.0104 | 99.2236 | 86.0475 | 87.426 | | MegatronBertForCausalLM | 4 | 102.0101 | 103.8882 | 141.1639 | 96.1609 | 84.5642 | 85.3654 | | DistillGPT2 | 16 | 120.662 | 126.7251 | nan | 131.7885 | 83.9215 | 81.396 | | DebertaForQuestionAnswering | 8 | 85.377 | 83.9413 | 120.1883 | 96.3057 | 78.2501 | 67.294 | | ElectraForCausalLM | 32 | 106.0284 | 113.4764 | nan | 104.1134 | 75.0757 | 72.9751 | | T5Small | 4 | 103.8784 | 110.9635 | 144.838 | 95.0095 | 73.7565 | 73.7334 | | T5ForConditionalGeneration | 4 | 104.0685 | 111.3613 | 143.8528 | 94.3038 | 73.7056 | 73.657 | | GoogleFnet | 16 | 101.6946 | 101.8389 | nan | 66.0199 | 70.3726 | 65.2825 | | XGLMForCausalLM | 8 | 81.5566 | 85.3631 | 108.825 | 259.4441 | 70.1131 | 74.2184 | | BlenderbotSmallForCausalLM | 64 | 64.3283 | 71.0869 | 94.7042 | 70.6318 | 67.7678 | 65.7482 | | Speech2Text2ForCausalLM | 256 | 64.0117 | 69.0208 | 98.0895 | 68.0707 | 64.9593 | 62.4312 | | MT5ForConditionalGeneration | 16 | 95.4964 | 95.923 | 97.4354 | 85.9538 | 58.2288 | 63.0262 | | GPT2ForSequenceClassification | 4 | 102.1108 | 104.5619 | nan | 147.3611 | 57.8115 | 57.8568 | | DebertaForMaskedLM | 4 | 68.2283 | 76.7415 | 83.8754 | 95.6909 | 56.2423 | 58.4202 | | AlbertForMaskedLM | 4 | 383.9167 | 384.2168 | nan | 312.7285 | nan | 307.0336 | | TrOCRForCausalLM | 32 | 166.3862 | 174.5364 | nan | 172.9165 | nan | 162.9686 | | BlenderbotForCausalLM | 4 | 92.9056 | 94.8711 | nan | 98.4173 | nan | 92.1339 | | AllenaiLongformerBase | 4 | 250.3272 | 262.2739 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.999 | 0.9731 | 0.8238 | 1.2975 | 1.864 | 1.826 | | lcnet_050 | 128 | 0.9567 | 0.9499 | 0.7656 | 1.3559 | 1.6511 | 1.6282 | | regnety_002 | 128 | 0.9788 | 1.0007 | 0.941 | 0.9624 | 1.5068 | 1.3168 | | hrnet_w18 | 128 | 0.9998 | 0.9982 | 0.0 | 1.2803 | 1.4188 | 1.3686 | | dla102 | 128 | 0.9999 | 1.0004 | 0.0 | 1.2832 | 1.3829 | 1.3695 | | volo_d1_224 | 64 | 0.9998 | 0.9957 | 0.8024 | 0.0 | 1.3805 | 1.3599 | | res2net50_14w_8s | 128 | 0.9998 | 0.9994 | 0.0 | 1.25 | 1.3566 | 1.3242 | | coat_lite_mini | 128 | 0.9998 | 1.0 | 0.8461 | 1.0934 | 1.3522 | 1.354 | | xcit_large_24_p8_224 | 5 | 1.0024 | 0.9958 | 0.794 | 0.0 | 1.3403 | 1.2985 | | mobilenetv3_large_100 | 128 | 0.9655 | 0.9624 | 0.7518 | 1.2561 | 1.3345 | 1.3488 | | inception_v3 | 128 | 1.0 | 0.9985 | 0.0 | 1.1257 | 1.3282 | 1.3091 | | gluon_inception_v3 | 128 | 1.0 | 0.9985 | 0.0 | 1.1287 | 1.328 | 1.3083 | | adv_inception_v3 | 128 | 0.9999 | 0.9988 | 0.0 | 1.1293 | 1.3265 | 1.3076 | | mobilenetv2_100 | 128 | 0.9656 | 0.9643 | 0.7054 | 1.2842 | 1.3263 | 1.352 | | crossvit_9_240 | 128 | 0.9997 | 0.9976 | 0.7589 | 1.0373 | 1.3243 | 1.3003 | | resnest101e | 64 | 0.9997 | 1.003 | 0.0 | 1.1688 | 1.3138 | 1.2695 | | res2next50 | 128 | 0.9998 | 1.0008 | 0.0 | 1.1813 | 1.3104 | 1.2741 | | fbnetv3_b | 128 | 0.9639 | 0.9621 | 0.7601 | 1.2392 | 1.2815 | 1.2973 | | gmixer_24_224 | 128 | 1.0 | 0.8345 | 0.0 | 1.0824 | 1.2773 | 1.2637 | | botnet26t_256 | 128 | 0.9855 | 0.9838 | 0.7891 | 0.0 | 1.2764 | 1.2731 | | eca_botnext26ts_256 | 128 | 0.9865 | 0.7725 | 0.0 | 0.0 | 1.2723 | 1.25 | | sebotnet33ts_256 | 64 | 0.976 | 0.8073 | 0.0 | 0.0 | 1.2702 | 1.2759 | | mnasnet_100 | 128 | 0.9669 | 0.9613 | 0.7844 | 1.2547 | 1.2683 | 1.2843 | | selecsls42b | 128 | 0.9998 | 0.9988 | 0.8153 | 1.2152 | 1.2676 | 1.2533 | | eca_halonext26ts | 128 | 0.9871 | 0.7788 | 0.0 | 0.0 | 1.262 | 1.2471 | | tf_efficientnet_b0 | 128 | 0.9771 | 0.7838 | 0.0 | 1.163 | 1.2604 | 1.2677 | | fbnetc_100 | 128 | 0.9657 | 0.9625 | 0.7891 | 1.2458 | 1.2456 | 1.2599 | | jx_nest_base | 32 | 0.9997 | 0.9946 | 0.7347 | 0.0 | 1.2382 | 1.2247 | | spnasnet_100 | 128 | 0.9613 | 0.9573 | 0.7722 | 1.2207 | 1.2359 | 1.2558 | | ese_vovnet19b_dw | 128 | 0.9788 | 0.9777 | 0.7445 | 1.1489 | 1.2358 | 1.2468 | | cspdarknet53 | 64 | 0.9579 | 0.9487 | 0.7353 | 1.1737 | 1.2351 | 1.2463 | | res2net101_26w_4s | 64 | 0.9997 | 0.9962 | 0.7711 | 1.0951 | 1.2293 | 1.1861 | | convit_base | 64 | 0.9996 | 0.9988 | 0.0 | 0.0 | 1.2291 | 1.2255 | | gmlp_s16_224 | 128 | 0.9998 | 0.9992 | 0.0 | 1.0939 | 1.2138 | 1.2043 | | rexnet_100 | 128 | 0.9727 | 0.817 | 0.0 | 1.1624 | 1.213 | 1.2184 | | cait_m36_384 | 4 | 0.9997 | 0.9969 | 0.0 | 0.0 | 1.2118 | 1.1918 | | pnasnet5large | 16 | 0.9995 | 0.9979 | 0.0 | 1.0881 | 1.2082 | 1.1903 | | tinynet_a | 128 | 0.965 | 0.7751 | 0.6112 | 1.1478 | 1.1893 | 1.2001 | | dpn107 | 32 | 0.9577 | 0.9512 | 0.7787 | 1.0255 | 1.1879 | 1.2 | | tf_mixnet_l | 128 | 0.9851 | 0.8897 | 0.0 | 1.0921 | 1.1876 | 1.1866 | | pit_b_224 | 64 | 1.0002 | 0.9994 | 0.0 | 1.032 | 1.1866 | 1.1753 | | dm_nfnet_f0 | 128 | 0.999 | 0.9996 | 0.0 | 1.1298 | 1.1865 | 1.1573 | | twins_pcpvt_base | 64 | 0.9999 | 0.9967 | 0.7479 | 0.0 | 1.1823 | 1.1446 | | mobilevit_s | 64 | 0.9797 | 0.7608 | 0.0 | 0.0 | 1.174 | 1.1674 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0989 | 1.1739 | 1.1739 | | poolformer_m36 | 64 | 0.9998 | 0.9995 | 0.0 | 0.0 | 1.1678 | 1.1499 | | repvgg_a2 | 128 | 0.9642 | 0.9629 | 0.8257 | 1.1364 | 1.1643 | 1.1687 | | nfnet_l0 | 128 | 1.0001 | 0.7886 | 0.0 | 1.1053 | 1.1407 | 1.1163 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9782 | 0.0 | 0.0 | 1.1261 | 1.1174 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9819 | 0.0 | 0.0 | 1.1109 | 1.1012 | | swsl_resnext101_32x16d | 32 | 1.0 | 0.998 | 0.0 | 1.1088 | 1.1074 | 1.0718 | | deit_base_distilled_patch16_224 | 64 | 0.9999 | 0.9976 | 0.768 | 0.9809 | 1.0966 | 1.0825 | | gluon_xception65 | 32 | 0.9998 | 0.9977 | 0.0 | 1.0806 | 1.0875 | 1.0738 | | vit_base_patch16_224 | 64 | 0.9996 | 0.9989 | 0.7675 | 0.9509 | 1.0868 | 1.0725 | | gernet_l | 128 | 0.9745 | 0.9734 | 0.8235 | 1.0996 | 1.0762 | 1.0701 | | convmixer_768_32 | 32 | 0.9998 | 0.9997 | 0.0 | 0.0 | 1.0759 | 1.0733 | | mixer_b16_224 | 128 | 0.9999 | 1.0003 | 0.0 | 0.8936 | 1.0713 | 1.0671 | | convnext_base | 64 | 0.9999 | 0.9986 | 0.0 | 0.0 | 1.0504 | 1.0425 | | visformer_small | 128 | 0.9995 | 1.0028 | 0.7971 | 0.0 | 1.0419 | 1.0097 | | resmlp_12_224 | 128 | 0.9998 | 1.0007 | 0.693 | 1.2118 | 0.9662 | 0.9574 | | tnt_s_patch16_224 | 128 | 0.9997 | 0.9996 | 0.0 | 0.0 | 0.0 | 1.5101 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.2683 | 26.9601 | nan | 579.0035 | 111.7642 | 107.5845 | | xcit_large_24_p8_224 | 5 | 3.1614 | 15.4606 | 28.0716 | nan | 80.8469 | 78.6596 | | twins_pcpvt_base | 64 | 2.3856 | 11.7287 | 19.6587 | nan | 80.1243 | 79.2982 | | swin_base_patch4_window7_224 | 64 | 2.9264 | 11.5604 | nan | nan | 78.2376 | 76.9332 | | mobilevit_s | 64 | 1.8328 | 6.8901 | nan | nan | 76.1345 | 75.9391 | | pnasnet5large | 16 | 4.8208 | 19.5122 | nan | 312.7596 | 73.1293 | 69.7013 | | cait_m36_384 | 4 | 3.0343 | 16.4133 | nan | nan | 63.6422 | 61.9082 | | dm_nfnet_f0 | 128 | 2.4259 | 6.7671 | nan | 145.676 | 61.1531 | 60.7601 | | resnest101e | 64 | 3.4159 | 14.103 | nan | 256.8959 | 59.4706 | 55.9626 | | coat_lite_mini | 128 | 1.1014 | 4.4788 | 6.8209 | 94.2328 | 58.4709 | 57.2774 | | jx_nest_base | 32 | 1.866 | 8.3707 | 13.5289 | nan | 54.4776 | 51.9212 | | res2net101_26w_4s | 64 | 3.1178 | 14.821 | 24.5046 | 226.0595 | 53.7823 | 51.0539 | | eca_halonext26ts | 128 | 1.6077 | 5.3327 | nan | nan | 51.0663 | 49.902 | | res2net50_14w_8s | 128 | 2.7183 | 13.37 | nan | 241.8898 | 50.6823 | 47.3728 | | poolformer_m36 | 64 | 1.7112 | 6.8685 | nan | nan | 49.4674 | 46.4426 | | nfnet_l0 | 128 | 1.9786 | 6.6595 | nan | 124.626 | 47.6526 | 47.0715 | | convnext_base | 64 | 1.4474 | 6.311 | nan | nan | 45.2544 | 44.4096 | | dpn107 | 32 | 4.1871 | 12.8309 | 33.1581 | 164.2582 | 42.2358 | 40.7844 | | sebotnet33ts_256 | 64 | 1.8083 | 5.7849 | nan | nan | 41.101 | 40.2023 | | gmlp_s16_224 | 128 | 1.0545 | 5.8571 | nan | 135.1055 | 41.0194 | 38.713 | | volo_d1_224 | 64 | 1.3339 | 6.8002 | 10.4039 | nan | 37.6356 | 35.6949 | | crossvit_9_240 | 128 | 1.5022 | 7.453 | 10.9489 | 149.1611 | 37.1214 | 36.2455 | | fbnetv3_b | 128 | 3.329 | 10.4089 | 25.7749 | 212.284 | 36.2751 | 34.8795 | | gluon_xception65 | 32 | 1.9365 | 9.6297 | nan | 149.6763 | 35.9795 | 34.4707 | | eca_botnext26ts_256 | 128 | 1.624 | 4.8632 | nan | nan | 34.3314 | 33.8889 | | ghostnet_100 | 128 | 2.9347 | 8.9369 | 13.4829 | 156.4558 | 34.1391 | 31.3817 | | gluon_inception_v3 | 128 | 1.6436 | 7.5665 | nan | 140.917 | 32.8972 | 31.9926 | | adv_inception_v3 | 128 | 1.8156 | 8.1552 | nan | 138.4923 | 32.8858 | 32.2781 | | inception_v3 | 128 | 1.6504 | 7.6764 | nan | 141.2337 | 32.5814 | 31.8517 | | tf_mixnet_l | 128 | 3.6565 | 9.8979 | nan | 149.9258 | 31.4848 | 30.3593 | | gmixer_24_224 | 128 | 1.1657 | 6.7255 | nan | 123.3353 | 31.2934 | 29.6624 | | dla102 | 128 | 1.8584 | 8.4464 | nan | 181.3092 | 30.9555 | 30.1857 | | swsl_resnext101_32x16d | 32 | 1.8026 | 8.0942 | nan | 122.4968 | 30.8154 | 29.1369 | | botnet26t_256 | 128 | 1.5283 | 4.2934 | 8.525 | nan | 30.7814 | 30.4834 | | mixnet_l | 128 | 3.2278 | 9.584 | nan | 152.1047 | 29.9884 | 28.5336 | | convit_base | 64 | 1.0419 | 5.0956 | nan | nan | 28.5307 | 28.1692 | | res2next50 | 128 | 1.5817 | 7.3202 | nan | 153.0636 | 28.4967 | 27.0097 | | rexnet_100 | 128 | 2.1039 | 6.7524 | nan | 143.3873 | 27.072 | 25.5942 | | tinynet_a | 128 | 2.157 | 7.6812 | 17.7048 | 140.3354 | 26.5694 | 24.8459 | | cspdarknet53 | 64 | 2.4497 | 7.1477 | 16.5508 | 122.0615 | 23.2021 | 22.2239 | | tf_efficientnet_b0 | 128 | 1.9532 | 6.2471 | nan | 130.0669 | 23.0042 | 22.0777 | | convmixer_768_32 | 32 | 1.1698 | 5.3942 | nan | nan | 22.5838 | 21.889 | | mixer_b16_224 | 128 | 0.5692 | 2.7645 | nan | 68.0774 | 22.5721 | 21.3314 | | fbnetc_100 | 128 | 2.1252 | 6.3376 | 15.315 | 109.0254 | 22.2202 | 21.6629 | | spnasnet_100 | 128 | 2.1843 | 5.977 | 14.9016 | 108.704 | 22.0857 | 21.3022 | | visformer_small | 128 | 1.0145 | 3.7715 | 5.8684 | nan | 21.9038 | 21.651 | | resmlp_12_224 | 128 | 0.6772 | 2.496 | 4.3912 | 33.2005 | 21.8384 | 20.6889 | | beit_base_patch16_224 | 64 | 1.3046 | 5.3753 | nan | nan | 21.0007 | 19.6698 | | pit_b_224 | 64 | 1.0574 | 4.3352 | nan | 92.8815 | 20.622 | 20.2853 | | mobilenetv3_large_100 | 128 | 1.6362 | 5.1464 | 12.2332 | 119.7278 | 20.6072 | 19.6927 | | deit_base_distilled_patch16_224 | 64 | 0.9602 | 3.9283 | 6.063 | 73.3675 | 20.0424 | 19.836 | | mobilenetv2_100 | 128 | 1.8215 | 5.0965 | 12.459 | 96.0753 | 19.8006 | 18.4573 | | repvgg_a2 | 128 | 2.1323 | 5.6922 | 13.9472 | 154.4386 | 19.6926 | 17.8387 | | vit_base_patch16_224 | 64 | 0.925 | 3.8644 | 5.9622 | 72.6839 | 19.3793 | 18.6428 | | mnasnet_100 | 128 | 1.6891 | 4.9789 | 11.7105 | 89.6359 | 18.9961 | 18.0795 | | gernet_l | 128 | 2.134 | 5.6772 | 13.5891 | 90.8813 | 18.2191 | 18.0568 | | regnety_002 | 128 | 1.6591 | 4.998 | 11.4839 | 93.4756 | 18.1425 | 17.3151 | | selecsls42b | 128 | 0.8558 | 3.3434 | 5.1382 | 76.2455 | 15.8904 | 15.5108 | | lcnet_050 | 128 | 1.0892 | 3.1678 | 6.9717 | 65.8407 | 13.4306 | 12.8914 | | ese_vovnet19b_dw | 128 | 1.037 | 2.7949 | 6.0759 | 55.5317 | 12.9398 | 12.1509 | | tnt_s_patch16_224 | 128 | 1.7277 | 9.0841 | nan | nan | nan | 34.344 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.9898 | 1.351 | 1.5843 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1773 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1377 | 1.2737 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.253 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.1264 | 1.3578 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1802 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.0689 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9927 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9882 | 1.0887 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0539 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9828 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9765 | 1.1445 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9653 | 1.0595 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9633 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9468 | 1.1098 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 0.9435 | 1.0967 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.988 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.938 | 1.0123 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2863 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0515 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9065 | 1.0615 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9057 | 0.9838 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8525 | 1.0752 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.8485 | 1.0335 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8189 | 0.9416 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8154 | 1.0235 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6571 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7449 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6745 | 0.9137 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8633 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 365.1303 | 364.9603 | nan | nan | 339.6241 | 339.9769 | | hrnet_w18 | 128 | 416.6414 | 417.0749 | nan | 325.5794 | 293.4123 | 304.6642 | | convnext_base | 64 | 264.3881 | 264.2965 | nan | nan | 251.4371 | 253.2752 | | pnasnet5large | 16 | 289.5484 | 290.2584 | nan | 266.1299 | 239.6638 | 242.836 | | tf_mixnet_l | 128 | 257.1195 | 284.6568 | nan | 231.8593 | 213.3293 | 213.4241 | | swin_base_patch4_window7_224 | 64 | 237.3151 | 242.5659 | nan | nan | 210.8349 | 212.2517 | | mixnet_l | 128 | 247.4171 | 274.9882 | nan | 221.8192 | 207.536 | 207.563 | | swsl_resnext101_32x16d | 32 | 220.2486 | 220.0858 | nan | 198.729 | 198.7487 | 205.7158 | | dla102 | 128 | 269.7674 | 269.521 | nan | 210.0277 | 194.8993 | 196.8271 | | cait_m36_384 | 4 | 217.2077 | 217.5973 | nan | nan | 179.0249 | 182.1862 | | resnest101e | 64 | 231.2814 | 229.7217 | nan | 197.7077 | 175.5583 | 181.9622 | | dm_nfnet_f0 | 128 | 206.66 | 206.1479 | nan | 182.2271 | 173.6151 | 178.3225 | | adv_inception_v3 | 128 | 226.6913 | 227.1105 | nan | 200.7131 | 170.941 | 173.4838 | | gluon_inception_v3 | 128 | 226.805 | 227.2904 | nan | 200.9767 | 170.9397 | 173.481 | | inception_v3 | 128 | 226.6725 | 227.4098 | nan | 201.5024 | 170.8864 | 173.0719 | | res2net50_14w_8s | 128 | 229.7531 | 229.7416 | nan | 183.9076 | 169.3487 | 173.1945 | | gluon_xception65 | 32 | 182.7965 | 182.9972 | nan | 168.8328 | 168.0438 | 169.9006 | | convit_base | 64 | 196.6777 | 196.8292 | nan | nan | 160.1783 | 160.4924 | | res2next50 | 128 | 207.0426 | 206.7881 | nan | 175.3308 | 157.8492 | 162.4265 | | dpn107 | 32 | 191.2928 | 192.6421 | 235.2287 | 178.647 | 154.4195 | 152.5006 | | nfnet_l0 | 128 | 176.1074 | 223.0219 | nan | 159.1106 | 154.1959 | 157.925 | | gernet_l | 128 | 165.3186 | 165.5466 | 193.6199 | 146.4025 | 149.7248 | 150.1512 | | poolformer_m36 | 64 | 174.8947 | 174.905 | nan | nan | 149.561 | 151.7432 | | mixer_b16_224 | 128 | 158.6156 | 158.5951 | nan | 177.3443 | 148.4551 | 148.7484 | | coat_lite_mini | 128 | 191.8001 | 192.017 | 226.9607 | 175.8233 | 141.9534 | 141.6216 | | pit_b_224 | 64 | 158.6372 | 158.7146 | nan | 153.5259 | 133.7019 | 134.9164 | | eca_halonext26ts | 128 | 169.6009 | 215.0092 | nan | nan | 132.6744 | 134.304 | | eca_botnext26ts_256 | 128 | 163.5103 | 208.5904 | nan | nan | 126.8163 | 129.1704 | | gmlp_s16_224 | 128 | 152.4353 | 152.4772 | nan | 139.2103 | 125.5889 | 126.5656 | | res2net101_26w_4s | 64 | 151.8672 | 152.0562 | 197.1798 | 138.4531 | 123.743 | 128.0206 | | visformer_small | 128 | 128.3211 | 128.0332 | 161.1383 | nan | 123.1583 | 127.1352 | | fbnetv3_b | 128 | 163.0223 | 163.1596 | 206.8387 | 126.7841 | 122.4473 | 120.9354 | | botnet26t_256 | 128 | 152.3141 | 152.6978 | 190.652 | nan | 117.6053 | 117.9056 | | twins_pcpvt_base | 64 | 137.2127 | 137.5325 | 183.6672 | nan | 116.4314 | 119.7911 | | beit_base_patch16_224 | 64 | 128.6832 | 130.9169 | nan | nan | 116.0603 | 116.8948 | | gmixer_24_224 | 128 | 146.4594 | 175.6919 | nan | 135.4565 | 114.837 | 116.038 | | volo_d1_224 | 64 | 153.5804 | 154.3027 | 191.586 | nan | 111.5635 | 112.9781 | | vit_base_patch16_224 | 64 | 119.2789 | 119.2739 | 155.4273 | 125.4143 | 110.1428 | 111.4774 | | deit_base_distilled_patch16_224 | 64 | 120.2167 | 120.2838 | 156.4202 | 122.2946 | 109.9789 | 111.1764 | | repvgg_a2 | 128 | 127.195 | 127.2904 | 146.7974 | 107.9718 | 105.4384 | 105.0461 | | tf_efficientnet_b0 | 128 | 134.1169 | 167.3006 | nan | 112.6721 | 103.9003 | 103.2745 | | xcit_large_24_p8_224 | 5 | 143.8072 | 137.7382 | 173.5692 | nan | 102.1287 | 105.0771 | | cspdarknet53 | 64 | 130.3963 | 131.836 | 170.0396 | 106.687 | 101.1089 | 100.2578 | | jx_nest_base | 32 | 121.5669 | 121.8681 | 165.2221 | nan | 98.1567 | 98.9035 | | mobilevit_s | 64 | 117.1886 | 151.0731 | nan | nan | 97.622 | 98.3894 | | fbnetc_100 | 128 | 123.5892 | 123.9363 | 151.3575 | 95.8447 | 95.8537 | 94.6847 | | rexnet_100 | 128 | 119.5406 | 142.3627 | nan | 99.8386 | 95.781 | 95.3363 | | tinynet_a | 128 | 110.2282 | 137.1885 | 174.0617 | 92.6913 | 89.3978 | 88.5731 | | sebotnet33ts_256 | 64 | 114.8106 | 138.6134 | nan | nan | 87.9148 | 87.6185 | | spnasnet_100 | 128 | 106.1148 | 106.4514 | 132.2379 | 83.5446 | 82.5028 | 81.1533 | | ese_vovnet19b_dw | 128 | 99.7564 | 99.9162 | 131.2749 | 84.949 | 78.9242 | 78.2463 | | mnasnet_100 | 128 | 98.793 | 99.4038 | 121.8091 | 76.0604 | 75.3529 | 74.3362 | | crossvit_9_240 | 128 | 98.5614 | 98.5277 | 129.8604 | 94.848 | 74.6124 | 75.7136 | | resmlp_12_224 | 128 | 71.2857 | 71.2428 | 102.8526 | 58.7635 | 73.8413 | 74.5376 | | mobilenetv2_100 | 128 | 97.8844 | 98.1522 | 134.2159 | 73.6031 | 71.2521 | 69.966 | | selecsls42b | 128 | 89.6861 | 89.8367 | 109.9625 | 73.883 | 70.7139 | 71.5361 | | mobilenetv3_large_100 | 128 | 85.715 | 85.9328 | 110.4215 | 65.8274 | 62.0091 | 61.3764 | | ghostnet_100 | 128 | 115.036 | 118.0784 | 139.8326 | 88.4444 | 61.7509 | 62.8577 | | regnety_002 | 128 | 52.943 | 51.7039 | 61.1441 | 54.7368 | 35.1244 | 41.1028 | | lcnet_050 | 128 | 38.3741 | 38.6366 | 48.0503 | 27.0833 | 22.2135 | 22.5235 | | tnt_s_patch16_224 | 128 | 471.0077 | 470.8215 | nan | nan | nan | 311.8163 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/2lCfOCA.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/eARN5Jm.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/eCv0rh1.png)

anijain2305 commented 1 year ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 53/54 | 100%, 45/45 | 100%, 61/61 |
|       aot_eager        | 96%, 52/54 | 98%, 44/45  | 98%, 60/61  |
|     aot_cudagraphs     | 81%, 44/54 | 76%, 34/45  | 92%, 56/61  |
|    nvprims_nvfuser     | 56%, 30/54 |  7%, 3/45   | 54%, 33/61  |
|        inductor        | 81%, 44/54 | 87%, 39/45  | 89%, 54/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 91%, 41/45  | 89%, 54/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.00x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.24x    |    1.01x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.09x    |    1.09x    |
|        inductor        |   1.67x    |    1.62x    |    1.19x    |
| inductor_no_cudagraphs |   1.28x    |    1.53x    |    1.16x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.94    |    3.32     |    2.25     |
|       aot_eager        |    7.13    |    11.72    |    9.33     |
|     aot_cudagraphs     |   10.71    |    20.92    |    16.35    |
|    nvprims_nvfuser     |   60.33    |    96.80    |   141.90    |
|        inductor        |   75.58    |    43.75    |    76.79    |
| inductor_no_cudagraphs |   72.08    |    36.82    |    74.83    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.41x    |    0.37x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.07x    |    0.87x    |
|        inductor        |   0.78x    |    0.92x    |    0.88x    |
| inductor_no_cudagraphs |   0.92x    |    1.07x    |    1.03x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Previous report name: /data/home/anijain/cluster/cron_logs/day_323_19_11_22_performance_amp_108 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 81%, 44/54 | 81%, 44/54 | | inductor | huggingface | 83%, 38/46 | 87%, 39/45 | | inductor | timm_models | 89%, 54/61 | 89%, 54/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 85%, 46/54 | | inductor_no_cudagraphs | huggingface | 85%, 39/46 | 91%, 41/45 | | inductor_no_cudagraphs | timm_models | 89%, 54/61 | 89%, 54/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.64x | 1.66x | | inductor | huggingface | 1.64x | 1.62x | | inductor | timm_models | 1.17x | 1.17x | | inductor_no_cudagraphs | torchbench | 1.28x | 1.29x | | inductor_no_cudagraphs | huggingface | 1.56x | 1.54x | | inductor_no_cudagraphs | timm_models | 1.15x | 1.15x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | vision_maskrcnn | fail_to_run | 0.0000 | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | fail_accuracy | fail_accuracy | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | YituTechConvBert | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | ese_vovnet19b_dw | fail_accuracy | fail_accuracy | | timm_models | ghostnet_100 | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | | timm_models | hrnet_w18 | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | timm_vovnet | 1.1115 | 0.9018 | | torchbench | nvidia_deeprecommender | 0.9269 | 0.9616 | | torchbench | timm_regnet | 0.9225 | 0.818 | | torchbench | resnet50 | 0.9126 | 0.7539 | | torchbench | yolov3 | 0.8549 | 0.837 | | torchbench | mobilenet_v2 | 0.8515 | 0.8052 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | hf_GPT2_large | 0.0 | 1.8554 | | torchbench | tacotron2 | 0.0 | 0.8596 | | torchbench | dlrm | 0.0 | 0.8849 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForMaskedLM | 1.2557 | 0.8788 | | huggingface | DebertaV2ForQuestionAnswering | 1.1239 | 0.9054 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.273 | | huggingface | YituTechConvBert | 0.0 | 0.0 | | timm_models | mobilenetv3_large_100 | 1.1363 | 0.9466 | | timm_models | selecsls42b | 0.8812 | 0.8548 | | timm_models | repvgg_a2 | 0.8603 | 0.7584 | | timm_models | cspdarknet53 | 0.8258 | 0.8258 | | timm_models | dla102 | 0.8177 | 0.8404 | | timm_models | res2net101_26w_4s | 0.8035 | 0.7442 | | timm_models | gernet_l | 0.7987 | 0.7642 | | timm_models | tf_efficientnet_b0 | 0.7967 | 0.9185 | | timm_models | mobilevit_s | 0.7962 | 0.7603 | | timm_models | tinynet_a | 0.7777 | 0.8033 | | timm_models | resnest101e | 0.7502 | 0.7386 | | timm_models | visformer_small | 0.7343 | 0.751 | | timm_models | convmixer_768_32 | 0.7244 | 0.7222 | | timm_models | dpn107 | 0.7165 | 0.7794 | | timm_models | sebotnet33ts_256 | 0.7106 | 0.737 | | timm_models | gluon_xception65 | 0.7097 | 0.6396 | | timm_models | ese_vovnet19b_dw | 0.7089 | 0.654 | | timm_models | swsl_resnext101_32x16d | 0.6821 | 0.6621 | | timm_models | eca_botnext26ts_256 | 0.6772 | 0.589 | | timm_models | rexnet_100 | 0.665 | 0.6766 | | timm_models | res2net50_14w_8s | 0.6495 | 0.6665 | | timm_models | mobilenetv2_100 | 0.6446 | 0.718 | | timm_models | res2next50 | 0.6058 | 0.6478 | | timm_models | botnet26t_256 | 0.5853 | 0.7098 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+-----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+-----------+------------------------+ | torchbench | yolov3 | 1081.5878 | 1078.1258 | | torchbench | densenet121 | 399.0099 | 396.1912 | | torchbench | timm_efficientdet | 284.9827 | 286.8212 | | torchbench | mobilenet_v3_large | 166.8705 | 165.2243 | | torchbench | timm_efficientnet | 158.3583 | 155.2494 | | torchbench | hf_T5_large | 121.5188 | 119.3749 | | torchbench | mnasnet1_0 | 120.0039 | 117.2006 | | huggingface | XLNetLMHeadModel | 216.2468 | 167.644 | | huggingface | DebertaV2ForQuestionAnswering | 153.0711 | 64.1896 | | huggingface | DebertaV2ForMaskedLM | 150.6938 | 59.0165 | | timm_models | hrnet_w18 | 215.1034 | 206.6616 | | timm_models | res2net50_14w_8s | 201.8613 | 197.5206 | | timm_models | pnasnet5large | 198.1756 | 193.747 | | timm_models | ghostnet_100 | 198.0537 | 196.0626 | | timm_models | res2net101_26w_4s | 138.2807 | 136.6565 | | timm_models | dpn107 | 136.4089 | 132.843 | | timm_models | rexnet_100 | 125.0278 | 122.9583 | | timm_models | twins_pcpvt_base | 123.7006 | 122.246 | +-------------+-------------------------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+---------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+----------+------------------------+ | torchbench | mobilenet_v2 | 0.885 | 1.0922 | | torchbench | timm_vision_transformer_large | 0.879 | 0.9541 | | torchbench | hf_Bert | 0.8735 | 0.942 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | fastNLP_Bert | 0.8521 | 1.0681 | | torchbench | Background_Matting | 0.845 | 1.0426 | | torchbench | hf_DistilBert | 0.8384 | 0.9054 | | torchbench | timm_regnet | 0.836 | 1.0095 | | torchbench | yolov3 | 0.8316 | 0.9828 | | torchbench | hf_Bart | 0.8224 | 1.0097 | | torchbench | shufflenet_v2_x1_0 | 0.8123 | 0.9457 | | torchbench | resnet152 | 0.8058 | 0.9398 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | pytorch_unet | 0.7877 | 0.7907 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | timm_resnest | 0.7215 | 0.9569 | | torchbench | timm_vision_transformer | 0.7151 | 0.7249 | | torchbench | timm_vovnet | 0.6882 | 0.8809 | | torchbench | resnet50 | 0.6745 | 0.8696 | | torchbench | mnasnet1_0 | 0.659 | 0.7667 | | torchbench | mobilenet_v3_large | 0.6583 | 0.8111 | | torchbench | resnext50_32x4d | 0.651 | 0.7706 | | torchbench | squeezenet1_1 | 0.6336 | 0.7041 | | torchbench | hf_Reformer | 0.5851 | 1.0017 | | torchbench | lennard_jones | 0.5641 | 0.9993 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | resnet18 | 0.5498 | 0.618 | | torchbench | densenet121 | 0.5388 | 0.6133 | | torchbench | LearningToPaint | 0.4882 | 0.6195 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | dlrm | nan | 0.8155 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | DistilBertForMaskedLM | 0.8716 | 0.9439 | | huggingface | Speech2Text2ForCausalLM | 0.8672 | 0.9793 | | huggingface | ElectraForCausalLM | 0.856 | 0.9327 | | huggingface | BlenderbotSmallForCausalLM | 0.846 | 0.9426 | | huggingface | M2M100ForConditionalGeneration | 0.8269 | 1.0237 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9902 | | huggingface | MobileBertForMaskedLM | 0.6698 | 0.9649 | | huggingface | DebertaV2ForMaskedLM | 0.6117 | 0.9912 | | huggingface | MobileBertForQuestionAnswering | 0.5988 | 0.8126 | | huggingface | DebertaV2ForQuestionAnswering | 0.5266 | 0.9885 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0674 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1614 | | timm_models | mobilenetv2_100 | 0.8962 | 1.1046 | | timm_models | vit_base_patch16_224 | 0.8916 | 0.8968 | | timm_models | deit_base_distilled_patch16_224 | 0.8911 | 0.8962 | | timm_models | mixnet_l | 0.8815 | 0.98 | | timm_models | eca_botnext26ts_256 | 0.8765 | 1.1944 | | timm_models | dla102 | 0.8723 | 1.0162 | | timm_models | fbnetv3_b | 0.8648 | 1.0056 | | timm_models | gluon_inception_v3 | 0.8599 | 0.9862 | | timm_models | inception_v3 | 0.8599 | 0.9862 | | timm_models | adv_inception_v3 | 0.8599 | 0.9862 | | timm_models | swsl_resnext101_32x16d | 0.852 | 0.9728 | | timm_models | dpn107 | 0.8455 | 0.9441 | | timm_models | gluon_xception65 | 0.8442 | 0.965 | | timm_models | cspdarknet53 | 0.8368 | 0.9122 | | timm_models | crossvit_9_240 | 0.8175 | 1.1003 | | timm_models | res2net101_26w_4s | 0.8146 | 0.9442 | | timm_models | resmlp_12_224 | 0.8092 | 0.8239 | | timm_models | ese_vovnet19b_dw | 0.8041 | 1.0134 | | timm_models | convnext_base | 0.8022 | 1.0059 | | timm_models | selecsls42b | 0.7927 | 0.9534 | | timm_models | spnasnet_100 | 0.787 | 0.9294 | | timm_models | coat_lite_mini | 0.7834 | 1.0066 | | timm_models | mnasnet_100 | 0.7727 | 0.9234 | | timm_models | res2net50_14w_8s | 0.7713 | 0.9528 | | timm_models | ghostnet_100 | 0.7706 | 1.0052 | | timm_models | res2next50 | 0.7697 | 0.9414 | | timm_models | hrnet_w18 | 0.7607 | 0.9414 | | timm_models | swin_base_patch4_window7_224 | 0.7566 | 0.9257 | | timm_models | mobilenetv3_large_100 | 0.75 | 0.9634 | | timm_models | sebotnet33ts_256 | 0.7318 | 0.8133 | | timm_models | gernet_l | 0.7239 | 0.9336 | | timm_models | fbnetc_100 | 0.7101 | 0.9306 | | timm_models | lcnet_050 | 0.6955 | 0.8352 | | timm_models | jx_nest_base | 0.6668 | 0.8553 | | timm_models | botnet26t_256 | 0.6615 | 0.9433 | | timm_models | regnety_002 | 0.5858 | 0.8993 | | timm_models | repvgg_a2 | 0.5572 | 0.8383 | +-------------+---------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Performance speedup regressions ~~~ +------------------------+-------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------+-------------+------------+ | inductor_no_cudagraphs | timm_vovnet | 0.9567 | 0.9018 | +------------------------+-------------+-------------+------------+ ~~~ No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Accuracy regressions ~~~ +------------------------+------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+---------------+ | inductor | ese_vovnet19b_dw | pass | fail_accuracy | | inductor_no_cudagraphs | ese_vovnet19b_dw | pass | fail_accuracy | +------------------------+------------------+-------------+---------------+ ~~~ Performance speedup regressions ~~~ +------------------------+-----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------------+-------------+------------+ | inductor_no_cudagraphs | mobilenetv3_large_100 | 1.0811 | 0.9466 | +------------------------+-----------------------+-------------+------------+ ~~~ No regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0 | 0.9302 | 2.4678 | 0.7339 | 5.5242 | 1.1328 | | timm_efficientdet | 1 | 0.9846 | 0.8114 | 2.1017 | 0.0 | 4.0831 | 1.4561 | | BERT_pytorch | 16 | 1.0083 | 0.8391 | 1.86 | 0.7823 | 3.4676 | 2.3282 | | timm_vision_transformer | 8 | 1.0047 | 0.8555 | 1.7587 | 0.6173 | 3.0601 | 1.5581 | | dcgan | 32 | 0.9728 | 0.915 | 1.6608 | 0.7037 | 2.8671 | 1.0287 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.995 | 0.9731 | 1.7533 | 0.0 | 2.8276 | 1.6783 | | resnet18 | 16 | 1.0029 | 0.9984 | 1.6882 | 0.7945 | 2.6766 | 1.2039 | | hf_T5_large | 2 | 1.0189 | 0.8643 | 0.0 | 0.0 | 2.5773 | 2.2391 | | resnext50_32x4d | 8 | 1.0028 | 0.958 | 2.114 | 0.7405 | 2.4029 | 1.0188 | | hf_Bert | 4 | 1.0314 | 0.865 | 0.9467 | 0.0 | 2.3852 | 1.9013 | | hf_Albert | 8 | 1.0016 | 0.9634 | 0.7735 | 0.0 | 2.375 | 2.3219 | | mobilenet_v3_large | 32 | 1.0024 | 1.005 | 1.4854 | 0.7682 | 2.3336 | 1.1748 | | drq | 1 | 0.9877 | 0.8145 | 1.9555 | 0.5975 | 2.295 | 1.1238 | | squeezenet1_1 | 32 | 0.9909 | 0.9575 | 1.5026 | 0.7231 | 2.1334 | 1.2083 | | pytorch_struct | 200 | 0.9822 | 0.7444 | 1.0154 | 0.5962 | 2.0761 | 1.2691 | | lennard_jones | 1000 | 0.9462 | 0.7718 | 1.2346 | 0.4668 | 2.029 | 1.0468 | | hf_GPT2 | 4 | 1.0218 | 0.9871 | 0.8141 | 0.2886 | 1.9308 | 1.9085 | | hf_T5 | 8 | 0.9998 | 0.9242 | 0.0 | 1.3518 | 1.8787 | 1.8788 | | mnasnet1_0 | 32 | 0.9989 | 1.02 | 1.2585 | 0.758 | 1.799 | 1.0958 | | LearningToPaint | 96 | 0.9974 | 1.0187 | 1.1686 | 0.8113 | 1.7417 | 1.2325 | | hf_Bart | 4 | 1.012 | 0.8402 | 0.8827 | 0.0 | 1.7409 | 1.7361 | | attention_is_all_you_need_pytorch | 256 | 1.0069 | 0.9143 | 0.8324 | 0.0 | 1.7194 | 1.618 | | timm_efficientnet | 32 | 0.9604 | 0.812 | 1.091 | 0.6798 | 1.6622 | 0.9967 | | soft_actor_critic | 256 | 0.9787 | 0.7374 | 1.2598 | 0.5241 | 1.6324 | 1.0082 | | shufflenet_v2_x1_0 | 128 | 1.0017 | 1.019 | 0.9822 | 0.8463 | 1.6018 | 1.2354 | | speech_transformer | 32 | 0.9945 | 0.8265 | 1.7449 | 0.6323 | 1.5632 | 1.5838 | | fastNLP_Bert | 6 | 0.9966 | 0.9068 | 0.7649 | 0.0 | 1.527 | 1.4698 | | hf_DistilBert | 8 | 0.9996 | 0.9722 | 0.7422 | 0.3653 | 1.5051 | 1.4781 | | pytorch_stargan | 16 | 0.9996 | 1.0929 | 1.0438 | 0.0 | 1.4714 | 1.4161 | | resnet152 | 32 | 1.0026 | 1.0006 | 1.261 | 0.0 | 1.3204 | 1.0102 | | timm_nfnet | 128 | 0.9984 | 1.0 | 0.8781 | 0.9181 | 1.3145 | 1.2651 | | pytorch_unet | 1 | 0.9989 | 0.2116 | 0.0 | 0.0 | 1.3074 | 1.3086 | | vgg16 | 64 | 0.9993 | 0.9971 | 0.8574 | 0.973 | 1.2664 | 1.2572 | | Super_SloMo | 6 | 0.9996 | 0.1768 | 0.0 | 0.0 | 1.2505 | 1.2107 | | alexnet | 128 | 0.9986 | 0.997 | 0.8141 | 0.9214 | 1.2155 | 1.2127 | | hf_Reformer | 4 | 0.9972 | 0.9987 | 0.9915 | 0.6439 | 1.1776 | 1.1787 | | Background_Matting | 4 | 0.9993 | 0.1454 | 0.0 | 0.0 | 1.1448 | 1.129 | | timm_resnest | 32 | 1.0031 | 1.0216 | 0.8368 | 0.9602 | 1.1195 | 1.0318 | | timm_vovnet | 32 | 0.9208 | 0.8887 | 0.8644 | 0.8016 | 1.1115 | 0.9018 | | timm_vision_transformer_large | 8 | 0.9998 | 0.9896 | 0.0 | 0.0 | 1.1085 | 1.091 | | tts_angular | 64 | 0.9859 | 0.9255 | 0.9759 | 0.9356 | 1.0287 | 1.0083 | | demucs | 4 | 1.0015 | 1.0008 | 1.0007 | 1.0021 | 0.9987 | 1.0 | | nvidia_deeprecommender | 256 | 0.9984 | 0.9957 | 0.6965 | 1.0054 | 0.9269 | 0.9616 | | timm_regnet | 32 | 0.9796 | 0.9412 | 0.896 | 0.7767 | 0.9225 | 0.818 | | resnet50 | 32 | 1.0009 | 1.0189 | 1.0379 | 0.8034 | 0.9126 | 0.7539 | | yolov3 | 16 | 0.9992 | 0.9899 | 0.8036 | 0.0 | 0.8549 | 0.837 | | mobilenet_v2 | 96 | 0.9996 | 0.9892 | 0.7583 | 1.0136 | 0.8515 | 0.8052 | | functorch_dp_cifar10 | 64 | 1.0009 | 0.9512 | 2.4209 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9504 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.9101 | 0.8436 | 0.8578 | 0.0 | 0.0 | 0.0 | | hf_GPT2_large | 4 | 1.0001 | 0.9912 | 0.0 | 0.0 | 0.0 | 1.8554 | | tacotron2 | 64 | 0.959 | 0.7583 | 0.9702 | 0.5987 | 0.0 | 0.8596 | | dlrm | 1024 | 0.8932 | 0.6227 | 0.0 | 1.3 | 0.0 | 0.8849 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | 0.0000 | fail_to_run | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | fail_to_run | 0.0000 | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ | yolov3 | 16 | 1.8273 | 7.6398 | 11.362 | nan | 1081.5878 | 1078.1258 | | densenet121 | 4 | 2.441 | 13.4489 | 20.3409 | 217.1022 | 399.0099 | 396.1912 | | timm_efficientdet | 1 | 10.9214 | 30.1375 | 63.188 | nan | 284.9827 | 286.8212 | | mobilenet_v3_large | 32 | 1.0437 | 5.2551 | 7.6924 | 117.2624 | 166.8705 | 165.2243 | | timm_efficientnet | 32 | 2.041 | 7.5551 | 15.5239 | 148.7464 | 158.3583 | 155.2494 | | hf_T5_large | 2 | 14.0168 | 42.5852 | nan | nan | 121.5188 | 119.3749 | | mnasnet1_0 | 32 | 0.9715 | 4.7465 | 6.8288 | 85.8785 | 120.0039 | 117.2006 | | resnet152 | 32 | 2.7888 | 14.5996 | 22.9957 | nan | 106.4084 | 105.896 | | resnext50_32x4d | 8 | 1.0219 | 4.9602 | 7.2674 | 80.3186 | 104.4695 | 105.2352 | | timm_vovnet | 32 | 1.5816 | 4.7004 | 9.8411 | 70.8964 | 93.4487 | 90.9119 | | timm_regnet | 32 | 2.5571 | 8.791 | 19.187 | 139.5193 | 83.8983 | 82.3155 | | mobilenet_v2 | 96 | 0.9608 | 4.9318 | 7.4591 | 106.7438 | 78.8435 | 77.9521 | | timm_resnest | 32 | 0.6662 | 2.7201 | 4.0911 | 67.5584 | 77.9964 | 77.4739 | | shufflenet_v2_x1_0 | 128 | 1.1044 | 5.5217 | 8.1326 | 98.7303 | 77.8131 | 77.2906 | | resnet50 | 32 | 1.0301 | 4.9614 | 7.3213 | 95.9103 | 74.411 | 73.4624 | | timm_vision_transformer_large | 8 | 3.3837 | 16.7663 | nan | nan | 73.7646 | 72.1081 | | timm_nfnet | 128 | 2.3274 | 7.7683 | 11.1274 | 147.8799 | 67.011 | 66.5028 | | squeezenet1_1 | 32 | 0.2725 | 1.0373 | 1.47 | 7.3094 | 54.0241 | 53.6842 | | resnet18 | 16 | 0.469 | 1.962 | 2.8316 | 40.5992 | 41.4066 | 41.6762 | | timm_vision_transformer | 8 | 1.0445 | 4.9818 | 6.9098 | 80.8256 | 35.3121 | 34.2473 | | LearningToPaint | 96 | 0.4992 | 2.2076 | 3.0647 | 47.3229 | 35.2344 | 34.433 | | hf_Bart | 4 | 2.1606 | 9.72 | 14.5466 | nan | 34.0958 | 33.3086 | | BERT_pytorch | 16 | 1.7704 | 8.2648 | 12.3438 | 93.3742 | 33.3889 | 32.7231 | | attention_is_all_you_need_pytorch | 256 | 1.5787 | 7.8682 | 11.9263 | nan | 32.3753 | 32.6749 | | Background_Matting | 4 | 1.0465 | 9.2644 | nan | nan | 30.8537 | 30.3927 | | hf_T5 | 8 | 2.7155 | 9.4465 | nan | 80.7397 | 30.4877 | 30.0156 | | fastNLP_Bert | 6 | 1.8964 | 7.8122 | 12.1393 | nan | 29.0438 | 27.6841 | | speech_transformer | 32 | 2.1398 | 9.961 | 32.177 | 148.1387 | 28.4001 | 27.0739 | | pytorch_stargan | 16 | 0.4641 | 2.184 | 3.0211 | nan | 27.4648 | 27.3693 | | Super_SloMo | 6 | 1.199 | 8.3685 | nan | nan | 22.1998 | 21.7221 | | pytorch_struct | 200 | 0.2904 | 0.9223 | 1.5215 | 7.9212 | 22.158 | 18.6069 | | hf_Bert | 4 | 1.9421 | 7.5756 | 10.8553 | nan | 21.6645 | 21.265 | | hf_Albert | 8 | 1.6801 | 7.1705 | 10.7068 | nan | 21.2884 | 19.6078 | | hf_GPT2 | 4 | 1.8991 | 7.1433 | 9.8688 | 81.1445 | 20.9995 | 20.2853 | | hf_Reformer | 4 | 1.7864 | 3.1798 | 5.639 | 17.2924 | 18.7321 | 16.0599 | | hf_DistilBert | 8 | 0.8626 | 3.7495 | 5.9124 | 53.5578 | 14.1518 | 13.8453 | | pytorch_unet | 1 | 0.5166 | 3.3134 | nan | nan | 12.5718 | 12.2117 | | dcgan | 32 | 0.1827 | 0.4548 | 0.696 | 5.238 | 9.7675 | 10.1175 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4798 | 2.152 | 2.9983 | nan | 9.4555 | 9.334 | | drq | 1 | 0.3356 | 0.6746 | 1.0911 | 6.6652 | 4.4133 | 3.7543 | | vgg16 | 64 | 0.1946 | 0.7133 | 1.0727 | 5.838 | 4.2978 | 4.2006 | | alexnet | 128 | 0.1768 | 0.5258 | 0.7602 | 5.1354 | 3.7845 | 3.7644 | | nvidia_deeprecommender | 256 | 0.2093 | 0.4879 | 0.7969 | 6.0597 | 3.5854 | 3.281 | | soft_actor_critic | 256 | 0.2297 | 0.3894 | 0.5733 | 3.0022 | 3.468 | 2.8107 | | lennard_jones | 1000 | 0.1605 | 0.3695 | 0.5525 | 3.1463 | 2.1803 | 1.9285 | | tts_angular | 64 | 0.178 | 0.2461 | 0.3744 | 1.5921 | 2.0295 | 1.6553 | | demucs | 4 | 0.3541 | 0.3606 | 0.3584 | 0.3553 | 0.2722 | 0.2675 | | hf_GPT2_large | 4 | 6.371 | 22.1741 | nan | nan | nan | 55.3746 | | tacotron2 | 64 | 5.4001 | 20.6947 | 34.8686 | 91.1978 | nan | 45.9722 | | dlrm | 1024 | 0.3018 | 0.7139 | nan | 4.9871 | nan | 2.9867 | | hf_Longformer | 2 | 6.7829 | 15.9839 | 57.6063 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 0.3377 | 1.5183 | 2.2249 | nan | nan | nan | | hf_BigBird | 2 | 4.2254 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+-----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.1576 | 1.4693 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3344 | 1.1938 | 1.0901 | 1.0966 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.024 | 1.176 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3084 | nan | 0.9837 | 1.1225 | | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.272 | 0.4638 | 0.9696 | 1.2228 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.3799 | 1.1204 | 0.9649 | 1.1241 | | BERT_pytorch | 16 | 1.0003 | 0.8825 | 0.3995 | 1.1102 | 0.9558 | 1.1347 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0489 | | Super_SloMo | 6 | 1.0024 | 0.8284 | nan | nan | 0.9361 | 1.2946 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4815 | 0.9296 | 1.0969 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.0304 | 0.928 | 1.247 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 0.885 | 1.0922 | | timm_vision_transformer_large | 8 | 0.9973 | 0.8358 | nan | nan | 0.879 | 0.9541 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3902 | nan | 0.8735 | 0.942 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3701 | nan | 0.8521 | 1.0681 | | Background_Matting | 4 | 1.0142 | 0.6522 | nan | nan | 0.845 | 1.0426 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3414 | 1.0708 | 0.8384 | 0.9054 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.836 | 1.0095 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.8316 | 0.9828 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.8224 | 1.0097 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8123 | 0.9457 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.8058 | 0.9398 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | pytorch_unet | 1 | 0.9968 | 0.7229 | nan | nan | 0.7877 | 0.7907 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.734 | 0.7633 | 1.0588 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.7215 | 0.9569 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3925 | 1.0881 | 0.7151 | 0.7249 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3409 | 0.7755 | 0.6882 | 0.8809 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3561 | 0.7806 | 0.6745 | 0.8696 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.659 | 0.7667 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.6583 | 0.8111 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3883 | 0.81 | 0.651 | 0.7706 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.6336 | 0.7041 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0017 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.5641 | 0.9993 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3949 | 0.7314 | 0.5498 | 0.618 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.5388 | 0.6133 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.3826 | 0.6701 | 0.4882 | 0.6195 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | dlrm | 1024 | 0.8152 | 0.8152 | nan | 0.8149 | nan | 0.8155 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3993 | nan | 0.4112 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4447 | nan | nan | nan | | hf_Longformer | 2 | 0.9996 | 0.9671 | 0.3491 | nan | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 184.3384 | 186.0032 | nan | nan | 165.9352 | 168.5332 | | Background_Matting | 4 | 139.0433 | 917.887 | nan | nan | 116.7568 | 118.0694 | | timm_nfnet | 128 | 131.9506 | 131.0927 | 148.9979 | 142.8306 | 99.9446 | 103.5297 | | hf_T5 | 8 | 174.3383 | 188.1655 | nan | 128.7709 | 92.5117 | 92.7181 | | hf_T5_large | 2 | 216.2853 | 306.5104 | nan | nan | 89.2826 | 105.3627 | | yolov3 | 16 | 68.5194 | 69.1487 | 85.3105 | nan | 80.0936 | 81.7454 | | timm_regnet | 32 | 79.4902 | 77.1119 | 80.885 | 94.2201 | 79.7873 | 89.5314 | | resnet152 | 32 | 92.4518 | 90.9384 | 73.3421 | nan | 75.1369 | 94.0712 | | hf_Reformer | 4 | 82.3919 | 82.1177 | 82.8159 | 128.2003 | 69.7857 | 69.6975 | | Super_SloMo | 6 | 79.3415 | 449.9924 | nan | nan | 64.0002 | 65.3772 | | mobilenet_v2 | 96 | 48.7951 | 49.3057 | 64.5048 | 48.213 | 57.3396 | 60.7803 | | demucs | 4 | 57.4962 | 57.0747 | 57.0698 | 57.091 | 56.9909 | 56.7642 | | vgg16 | 64 | 66.2 | 66.4226 | 77.3446 | 67.9556 | 52.1958 | 52.5522 | | timm_efficientdet | 1 | 166.503 | 199.5607 | 77.6237 | nan | 42.143 | 115.926 | | speech_transformer | 32 | 60.4146 | 73.4872 | 35.3571 | 95.5694 | 40.4909 | 39.962 | | resnet50 | 32 | 33.701 | 32.6268 | 32.4506 | 42.3242 | 37.9673 | 45.7459 | | fastNLP_Bert | 6 | 55.8858 | 61.5206 | 73.1328 | nan | 36.7466 | 37.9818 | | attention_is_all_you_need_pytorch | 256 | 57.7901 | 57.7253 | 63.218 | nan | 34.1445 | 35.7949 | | hf_Bart | 4 | 56.4984 | 67.7482 | 65.3655 | nan | 33.3104 | 34.6983 | | pytorch_unet | 1 | 40.0249 | 188.3346 | nan | nan | 30.5131 | 30.5153 | | timm_vovnet | 32 | 34.7199 | 35.8529 | 37.1062 | 40.4252 | 28.8343 | 36.2962 | | timm_efficientnet | 32 | 49.1599 | 58.4416 | 43.48 | 69.9236 | 28.7442 | 47.9693 | | hf_Albert | 8 | 68.1173 | 71.6649 | 88.2224 | nan | 28.7225 | 29.4822 | | shufflenet_v2_x1_0 | 128 | 42.7814 | 39.8541 | 41.6548 | 48.2403 | 25.7288 | 35.5371 | | hf_GPT2 | 4 | 48.2348 | 50.0202 | 60.1574 | 169.614 | 25.5152 | 26.0699 | | timm_resnest | 32 | 24.4646 | 23.9736 | 29.4298 | 25.6658 | 22.3506 | 24.1433 | | hf_Bert | 4 | 46.588 | 48.0311 | 44.0548 | nan | 21.1318 | 23.2645 | | hf_DistilBert | 8 | 31.005 | 31.9476 | 41.8434 | 85.0678 | 20.6099 | 21.0299 | | mnasnet1_0 | 32 | 29.8106 | 28.6122 | 23.2229 | 39.2367 | 16.7794 | 29.5575 | | BERT_pytorch | 16 | 54.0681 | 77.198 | 35.2578 | 70.0859 | 16.5077 | 28.7943 | | mobilenet_v3_large | 32 | 35.5275 | 35.5137 | 24.0692 | 48.0279 | 15.7605 | 31.8654 | | densenet121 | 4 | 74.0396 | 86.5869 | 30.4237 | 104.829 | 14.0717 | 68.9599 | | resnext50_32x4d | 8 | 29.3687 | 30.3407 | 15.7595 | 39.7632 | 13.6005 | 29.7496 | | nvidia_deeprecommender | 256 | 10.3845 | 10.4175 | 14.8816 | 10.3235 | 11.1769 | 10.7778 | | pytorch_stargan | 16 | 16.42 | 14.8375 | 15.5371 | nan | 10.9711 | 11.4537 | | timm_vision_transformer | 8 | 29.3933 | 34.6065 | 16.7629 | 55.901 | 9.9727 | 20.2941 | | LearningToPaint | 96 | 14.8576 | 15.5604 | 12.8151 | 19.8477 | 8.7847 | 12.4702 | | alexnet | 128 | 9.8162 | 9.8625 | 12.0091 | 10.6591 | 8.0683 | 8.1126 | | squeezenet1_1 | 32 | 15.169 | 15.7335 | 10.1859 | 21.0879 | 7.2571 | 12.759 | | tts_angular | 64 | 6.8552 | 6.8133 | 6.9105 | 6.7231 | 6.5238 | 6.7145 | | pytorch_CycleGAN_and_pix2pix | 1 | 17.9417 | 18.7875 | 10.1507 | nan | 6.4785 | 10.9287 | | resnet18 | 16 | 12.9764 | 12.9778 | 8.5235 | 16.7994 | 4.9879 | 11.019 | | drq | 1 | 3.9352 | 4.9373 | 1.9997 | 6.7135 | 2.2959 | 4.3117 | | pytorch_struct | 200 | 4.6224 | 6.1477 | 4.5137 | 7.6927 | 2.2635 | 3.781 | | dcgan | 32 | 3.1692 | 3.3928 | 1.9023 | 4.4949 | 1.1112 | 3.4339 | | soft_actor_critic | 256 | 1.4218 | 1.9493 | 1.1529 | 2.8136 | 0.9382 | 1.682 | | lennard_jones | 1000 | 1.5009 | 1.933 | 1.2045 | 3.2339 | 0.7595 | 1.4837 | | tacotron2 | 64 | 3175.2675 | 4612.754 | 3120.0873 | 4964.9054 | nan | 3592.183 | | dlrm | 1024 | 245.4324 | 262.1893 | nan | 142.1094 | nan | 224.3392 | | hf_GPT2_large | 4 | 209.8637 | 211.2127 | nan | nan | nan | 112.8585 | | hf_Longformer | 2 | 129.7769 | 140.4929 | 137.713 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 14.1315 | 16.8764 | 5.9547 | nan | nan | nan | | hf_BigBird | 2 | 201.3618 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | MobileBertForMaskedLM | 64 | 1.0172 | 0.8428 | 1.2754 | 0.0 | 2.867 | 1.8242 | | MT5ForConditionalGeneration | 16 | 1.0238 | 0.8637 | 1.0921 | 0.8707 | 2.4015 | 2.0624 | | OPTForCausalLM | 2 | 1.0002 | 0.9269 | 0.0 | 0.7891 | 2.3982 | 2.376 | | GPT2ForSequenceClassification | 4 | 1.0006 | 0.9764 | 0.0 | 0.5019 | 2.2782 | 2.2559 | | MobileBertForQuestionAnswering | 128 | 1.0209 | 0.839 | 1.0262 | 0.0 | 2.1885 | 1.7942 | | ElectraForQuestionAnswering | 64 | 1.0009 | 0.9785 | 0.7673 | 0.0 | 2.1089 | 2.0536 | | XLNetLMHeadModel | 8 | 1.0006 | 0.9726 | 0.0 | 0.0 | 1.8995 | 1.8984 | | M2M100ForConditionalGeneration | 16 | 1.0373 | 0.8296 | 0.8839 | 0.6673 | 1.8733 | 1.4902 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9807 | 0.774 | 0.0 | 1.8323 | 1.7923 | | RobertaForQuestionAnswering | 16 | 0.9995 | 0.9797 | 0.7713 | 0.0 | 1.8251 | 1.782 | | ElectraForCausalLM | 32 | 1.0002 | 0.9422 | 0.7138 | 0.0 | 1.818 | 1.8223 | | BertForQuestionAnswering | 16 | 1.0006 | 0.9693 | 0.7632 | 0.0 | 1.8156 | 1.7731 | | XGLMForCausalLM | 8 | 1.0122 | 0.8239 | 1.0802 | 0.0 | 1.7426 | 1.5832 | | RobertaForCausalLM | 16 | 1.0005 | 0.9703 | 0.7604 | 0.0 | 1.6991 | 1.6829 | | DistillGPT2 | 16 | 0.9994 | 0.9693 | 0.7611 | 0.7419 | 1.6838 | 1.7125 | | AlbertForQuestionAnswering | 4 | 0.9996 | 0.8858 | 0.0 | 0.0 | 1.6635 | 1.6592 | | PLBartForConditionalGeneration | 4 | 0.9997 | 0.9547 | 0.7383 | 0.0 | 1.661 | 1.6496 | | MegatronBertForCausalLM | 4 | 1.0668 | 0.8443 | 0.7575 | 0.0 | 1.6452 | 1.5529 | | AlbertForMaskedLM | 4 | 1.0 | 0.8851 | 0.0 | 0.0 | 1.6421 | 1.6464 | | MegatronBertForQuestionAnswering | 8 | 0.9995 | 0.9705 | 0.772 | 0.0 | 1.624 | 1.5896 | | LayoutLMForMaskedLM | 16 | 1.0004 | 0.9712 | 0.756 | 0.0 | 1.619 | 1.6064 | | T5Small | 4 | 1.0031 | 0.9186 | 0.7538 | 1.1424 | 1.6073 | 1.5733 | | BertForMaskedLM | 16 | 1.0004 | 0.9704 | 0.744 | 0.0 | 1.6 | 1.5836 | | T5ForConditionalGeneration | 4 | 1.0052 | 0.9183 | 0.7544 | 1.1339 | 1.5988 | 1.5778 | | PLBartForCausalLM | 8 | 1.0003 | 0.9675 | 0.7594 | 0.9813 | 1.5498 | 1.6034 | | CamemBert | 16 | 1.0003 | 0.9718 | 0.7619 | 0.0 | 1.5313 | 1.5186 | | DistilBertForQuestionAnswering | 256 | 0.9995 | 0.9947 | 0.7573 | 0.6657 | 1.504 | 1.488 | | MBartForConditionalGeneration | 2 | 1.0084 | 0.9571 | 0.0 | 0.7001 | 1.4696 | 1.4359 | | BartForConditionalGeneration | 2 | 1.0035 | 0.9663 | 0.0 | 0.0 | 1.4549 | 1.4182 | | MBartForCausalLM | 4 | 0.9999 | 0.9693 | 0.7575 | 0.9847 | 1.4336 | 1.4314 | | BartForCausalLM | 4 | 1.0004 | 0.9687 | 0.7574 | 0.0 | 1.4222 | 1.4234 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0079 | 0.9265 | 0.7249 | 0.0 | 1.4057 | 1.4173 | | Speech2Text2ForCausalLM | 256 | 0.9977 | 0.9444 | 0.6862 | 0.9008 | 1.3326 | 1.3646 | | PegasusForConditionalGeneration | 32 | 1.0048 | 0.9507 | 0.0 | 0.8069 | 1.2697 | 1.2592 | | DebertaV2ForMaskedLM | 1 | 0.868 | 0.6783 | 0.7712 | 0.0 | 1.2557 | 0.8788 | | DebertaForMaskedLM | 4 | 0.8797 | 0.7105 | 0.784 | 0.0 | 1.251 | 1.1281 | | TrOCRForCausalLM | 32 | 0.9998 | 0.9637 | 0.0 | 0.0 | 1.2498 | 1.2592 | | DistilBertForMaskedLM | 128 | 0.9996 | 0.9599 | 0.7132 | 0.64 | 1.2187 | 1.2348 | | BlenderbotSmallForCausalLM | 64 | 1.0029 | 0.9272 | 0.716 | 0.0 | 1.2132 | 1.2298 | | PegasusForCausalLM | 32 | 0.9986 | 0.9512 | 0.749 | 0.8531 | 1.1518 | 1.1635 | | DebertaForQuestionAnswering | 8 | 0.9503 | 0.8574 | 0.7233 | 0.0 | 1.1421 | 1.19 | | DebertaV2ForQuestionAnswering | 2 | 0.8692 | 0.6868 | 0.0 | 0.0 | 1.1239 | 0.9054 | | AllenaiLongformerBase | 4 | 0.9691 | 0.9098 | 0.8731 | 0.0 | 0.0 | 0.0 | | BlenderbotForCausalLM | 4 | 1.013 | 0.0 | 0.0 | 0.0 | 0.0 | 1.273 | | YituTechConvBert | 16 | 1.0005 | 0.964 | 0.7918 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XLNetLMHeadModel | 8 | 5.224 | 22.0777 | nan | nan | 216.2468 | 167.644 | | DebertaV2ForQuestionAnswering | 2 | 9.655 | 20.2486 | nan | nan | 153.0711 | 64.1896 | | DebertaV2ForMaskedLM | 1 | 9.5887 | 20.5068 | 80.8221 | nan | 150.6938 | 59.0165 | | DebertaForMaskedLM | 4 | 5.5328 | 11.9226 | 36.5087 | nan | 93.4509 | 40.0414 | | DebertaForQuestionAnswering | 8 | 5.7383 | 11.9815 | 36.0804 | nan | 92.3831 | 42.707 | | XGLMForCausalLM | 8 | 3.3218 | 14.9412 | 27.9777 | nan | 85.0242 | 77.0203 | | MobileBertForQuestionAnswering | 128 | 9.8575 | 34.4008 | 60.7497 | nan | 74.3992 | 71.5844 | | MobileBertForMaskedLM | 64 | 9.8286 | 34.4483 | 59.4176 | nan | 71.8654 | 70.2831 | | M2M100ForConditionalGeneration | 16 | 4.3036 | 19.3947 | 27.955 | 220.8664 | 71.7648 | 63.3635 | | MT5ForConditionalGeneration | 16 | 4.0886 | 14.386 | 22.5168 | 124.7076 | 57.6118 | 55.4413 | | PegasusForConditionalGeneration | 32 | 3.9432 | 18.2365 | nan | 219.1433 | 55.8775 | 52.2933 | | BartForConditionalGeneration | 2 | 4.2243 | 18.7556 | nan | nan | 53.6659 | 51.3154 | | MBartForConditionalGeneration | 2 | 4.233 | 20.1238 | nan | 265.0846 | 53.4475 | 51.5024 | | MegatronBertForCausalLM | 4 | 4.272 | 15.666 | 23.3223 | nan | 43.8001 | 41.0606 | | MegatronBertForQuestionAnswering | 8 | 4.2745 | 15.5955 | 23.5915 | nan | 42.8919 | 41.033 | | BlenderbotSmallForConditionalGeneration | 64 | 2.6312 | 12.3895 | 19.3711 | nan | 36.4544 | 35.4209 | | T5ForConditionalGeneration | 4 | 2.7484 | 9.5253 | 14.5452 | 83.4032 | 32.5059 | 32.023 | | T5Small | 4 | 2.7231 | 9.7114 | 14.4059 | 82.3036 | 32.077 | 30.903 | | PLBartForConditionalGeneration | 4 | 2.2358 | 9.9227 | 14.7933 | nan | 31.651 | 30.5279 | | LayoutLMForSequenceClassification | 16 | 2.3115 | 8.1351 | 12.1635 | nan | 29.2389 | 28.2887 | | MBartForCausalLM | 4 | 1.6313 | 7.0971 | 10.4293 | 100.9122 | 28.4796 | 24.5636 | | ElectraForCausalLM | 32 | 2.0104 | 7.8555 | 11.6293 | nan | 28.2834 | 26.4569 | | PegasusForCausalLM | 32 | 1.6158 | 7.361 | 10.9142 | 95.1718 | 27.1041 | 25.0386 | | BartForCausalLM | 4 | 1.5916 | 7.6316 | 10.4626 | nan | 24.7322 | 23.5403 | | BertForQuestionAnswering | 16 | 2.0983 | 7.8678 | 11.3591 | nan | 24.2243 | 21.7931 | | LayoutLMForMaskedLM | 16 | 2.3489 | 8.2826 | 11.9613 | nan | 23.7582 | 23.038 | | TrOCRForCausalLM | 32 | 1.5749 | 7.5131 | nan | nan | 23.7117 | 22.7447 | | ElectraForQuestionAnswering | 64 | 1.9639 | 7.8916 | 11.295 | nan | 23.2869 | 22.2904 | | BertForMaskedLM | 16 | 2.0806 | 7.9419 | 11.4649 | nan | 23.1641 | 22.512 | | RobertaForCausalLM | 16 | 2.016 | 8.0397 | 11.5991 | nan | 22.9733 | 22.2264 | | OPTForCausalLM | 2 | 1.7077 | 7.2925 | nan | 91.242 | 22.7236 | 21.933 | | CamemBert | 16 | 2.0379 | 7.8568 | 12.0353 | nan | 21.9828 | 21.543 | | RobertaForQuestionAnswering | 16 | 2.0114 | 7.84 | 11.217 | nan | 21.4239 | 20.2816 | | GPT2ForSequenceClassification | 4 | 1.9288 | 7.1257 | nan | 80.6258 | 20.5107 | 20.4958 | | AlbertForMaskedLM | 4 | 1.8142 | 7.4352 | nan | nan | 19.9407 | 19.451 | | BlenderbotSmallForCausalLM | 64 | 1.1 | 4.7792 | 6.9159 | nan | 19.7024 | 16.9786 | | AlbertForQuestionAnswering | 4 | 1.7807 | 7.3831 | nan | nan | 19.3929 | 18.9838 | | Speech2Text2ForCausalLM | 256 | 0.9456 | 3.7079 | 5.7525 | 50.3002 | 15.7233 | 14.1791 | | PLBartForCausalLM | 8 | 0.9091 | 3.8723 | 5.5849 | 64.9709 | 15.1055 | 14.7824 | | DistillGPT2 | 16 | 1.033 | 3.7567 | 5.4787 | 42.7721 | 14.1705 | 13.9065 | | DistilBertForMaskedLM | 128 | 0.8335 | 3.8694 | 6.2291 | 55.7445 | 13.2 | 12.9901 | | DistilBertForQuestionAnswering | 256 | 0.8815 | 3.8345 | 6.1488 | 53.0661 | 12.7084 | 12.1655 | | BlenderbotForCausalLM | 4 | 2.9908 | nan | nan | nan | nan | 44.2403 | | AllenaiLongformerBase | 4 | 6.913 | 15.5639 | 58.3806 | nan | nan | nan | | YituTechConvBert | 16 | 2.8676 | 11.7096 | 18.2866 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9997 | 0.9183 | nan | 1.2641 | 1.2906 | 1.345 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.1305 | 1.559 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0992 | 1.5169 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.2229 | 1.0775 | 1.1712 | | MBartForCausalLM | 4 | 1.0 | 0.8998 | 0.3747 | 1.3748 | 1.0747 | 1.1342 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0568 | 1.1144 | | XLNetLMHeadModel | 8 | 0.9999 | 0.9214 | nan | nan | 1.0303 | 1.0303 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 1.017 | 1.0704 | | MBartForConditionalGeneration | 2 | 1.0 | 0.9035 | nan | 1.3227 | 1.0148 | 1.2186 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 1.0044 | 1.0277 | | PegasusForConditionalGeneration | 32 | 0.9979 | 0.9502 | nan | 1.2072 | 1.0039 | 1.1394 | | RobertaForQuestionAnswering | 16 | 1.004 | 0.9315 | 0.3619 | nan | 1.0036 | 1.0618 | | BertForQuestionAnswering | 16 | 1.004 | 0.9312 | 0.3618 | nan | 1.0029 | 1.0617 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9976 | 1.1976 | | DistilBertForQuestionAnswering | 256 | 1.0112 | 0.9568 | 0.3185 | 1.1483 | 0.9806 | 1.0864 | | DistillGPT2 | 16 | 1.0 | 0.8673 | 0.3597 | 1.1412 | 0.9755 | 1.0618 | | PegasusForCausalLM | 32 | 0.9749 | 0.8906 | 0.4175 | 1.1321 | 0.9708 | 1.0342 | | PLBartForConditionalGeneration | 4 | 0.9997 | 0.9325 | 0.3747 | nan | 0.9651 | 1.0848 | | T5Small | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9635 | 1.1856 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9635 | 1.1856 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9593 | 1.1105 | | MegatronBertForQuestionAnswering | 8 | 1.0006 | 0.9101 | 0.3721 | nan | 0.9562 | 1.0239 | | BertForMaskedLM | 16 | 1.0001 | 0.9237 | 0.3656 | nan | 0.9481 | 0.985 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9481 | 0.9848 | | RobertaForCausalLM | 16 | 1.0 | 0.9237 | 0.3654 | nan | 0.9475 | 0.9847 | | CamemBert | 16 | 1.0 | 0.9212 | 0.3657 | nan | 0.9446 | 0.983 | | TrOCRForCausalLM | 32 | 0.9998 | 0.8789 | nan | nan | 0.9345 | 1.0129 | | MT5ForConditionalGeneration | 16 | 1.0015 | 0.864 | 0.4151 | 1.0159 | 0.9203 | 1.0032 | | PLBartForCausalLM | 8 | 0.9999 | 0.8707 | 0.3624 | 1.0907 | 0.9166 | 0.989 | | MegatronBertForCausalLM | 4 | 1.0 | 0.8798 | 0.3875 | nan | 0.9121 | 1.0221 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8497 | 0.3516 | 1.0867 | 0.8716 | 0.9439 | | Speech2Text2ForCausalLM | 256 | 0.9668 | 0.8156 | 0.3505 | 1.0447 | 0.8672 | 0.9793 | | ElectraForCausalLM | 32 | 0.9977 | 0.8464 | 0.3928 | nan | 0.856 | 0.9327 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.846 | 0.9426 | | M2M100ForConditionalGeneration | 16 | 0.9836 | 0.8898 | 0.4205 | 1.0731 | 0.8269 | 1.0237 | | XGLMForCausalLM | 8 | 0.9918 | 0.9164 | 0.4336 | nan | 0.8055 | 0.9902 | | MobileBertForMaskedLM | 64 | 0.9999 | 0.8791 | 0.3355 | nan | 0.6698 | 0.9649 | | DebertaV2ForMaskedLM | 1 | 0.9982 | 0.9412 | 0.4918 | nan | 0.6117 | 0.9912 | | MobileBertForQuestionAnswering | 128 | 1.0159 | 1.0063 | 0.306 | nan | 0.5988 | 0.8126 | | DebertaV2ForQuestionAnswering | 2 | 0.9796 | 0.9795 | nan | nan | 0.5266 | 0.9885 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9818 | 0.3623 | nan | 0.409 | 1.0674 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3251 | nan | 0.3071 | 1.1614 | | BlenderbotForCausalLM | 4 | 1.0002 | nan | nan | nan | nan | 0.9343 | | YituTechConvBert | 16 | 0.9954 | 0.9173 | 0.3774 | nan | nan | nan | | AllenaiLongformerBase | 4 | 0.9984 | 0.9145 | 0.3334 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.6672 | 301.2178 | nan | nan | 162.9907 | 162.097 | | AlbertForQuestionAnswering | 4 | 264.4228 | 298.4664 | nan | nan | 159.3545 | 159.7092 | | XLNetLMHeadModel | 8 | 273.507 | 283.8525 | nan | nan | 144.9947 | 144.784 | | PegasusForConditionalGeneration | 32 | 139.598 | 148.5094 | nan | 174.6435 | 111.1776 | 112.0399 | | TrOCRForCausalLM | 32 | 137.3162 | 142.2869 | nan | nan | 109.7695 | 108.7989 | | DebertaV2ForQuestionAnswering | 2 | 119.7334 | 149.6162 | nan | nan | 95.1496 | 115.7682 | | BartForConditionalGeneration | 2 | 135.7047 | 140.8285 | nan | nan | 93.8398 | 95.6646 | | MBartForConditionalGeneration | 2 | 142.2548 | 150.984 | nan | 194.3102 | 93.0683 | 94.7072 | | DebertaV2ForMaskedLM | 1 | 116.1149 | 148.5287 | 132.4551 | nan | 92.2488 | 119.5881 | | MegatronBertForQuestionAnswering | 8 | 141.1115 | 145.5138 | 182.8103 | nan | 86.7958 | 88.9224 | | MobileBertForQuestionAnswering | 128 | 180.0205 | 212.008 | 182.8584 | nan | 83.5276 | 103.7096 | | BartForCausalLM | 4 | 112.3487 | 116.1482 | 148.0807 | nan | 78.949 | 78.8976 | | BlenderbotSmallForConditionalGeneration | 64 | 108.5086 | 119.3032 | 149.8081 | nan | 78.4262 | 78.0181 | | MBartForCausalLM | 4 | 112.315 | 115.8489 | 148.1491 | 114.2095 | 78.2914 | 78.1707 | | CamemBert | 16 | 117.9363 | 121.3666 | 154.739 | nan | 77.0025 | 77.6412 | | PLBartForCausalLM | 8 | 113.0894 | 116.9907 | 145.6649 | 115.412 | 71.2069 | 70.4094 | | M2M100ForConditionalGeneration | 16 | 104.0388 | 133.0011 | 122.625 | 165.0048 | 70.4981 | 76.8761 | | PLBartForConditionalGeneration | 4 | 116.3677 | 121.3082 | 157.8205 | nan | 70.0145 | 70.3461 | | LayoutLMForMaskedLM | 16 | 112.0531 | 115.2025 | 148.2108 | nan | 69.2203 | 69.8024 | | OPTForCausalLM | 2 | 165.8182 | 175.6777 | nan | 207.4444 | 69.2036 | 69.6743 | | DistilBertForMaskedLM | 128 | 84.1324 | 87.6218 | 117.8586 | 131.8567 | 68.9763 | 68.1623 | | DistilBertForQuestionAnswering | 256 | 102.8303 | 103.1863 | 135.7297 | 154.4368 | 68.6008 | 69.1879 | | BertForMaskedLM | 16 | 109.4428 | 112.8382 | 147.0064 | nan | 68.4496 | 69.1289 | | RobertaForCausalLM | 16 | 114.3624 | 117.7799 | 150.1846 | nan | 67.4179 | 67.9044 | | DebertaForQuestionAnswering | 8 | 78.8276 | 87.5262 | 103.9828 | nan | 65.6983 | 64.2877 | | MobileBertForMaskedLM | 64 | 183.0081 | 214.5384 | 145.3202 | nan | 64.5862 | 104.0614 | | T5ForConditionalGeneration | 4 | 101.3611 | 110.2064 | 135.5143 | 89.2278 | 63.4372 | 64.3929 | | T5Small | 4 | 101.3559 | 109.9097 | 135.622 | 88.4278 | 63.2083 | 64.1614 | | DistillGPT2 | 16 | 105.6508 | 108.9545 | 139.0135 | 142.3237 | 62.7705 | 61.7346 | | PegasusForCausalLM | 32 | 69.0361 | 72.1004 | 91.7748 | 80.5801 | 59.6869 | 59.6248 | | ElectraForQuestionAnswering | 64 | 114.2791 | 117.0499 | 149.0911 | nan | 54.3661 | 55.7052 | | MegatronBertForCausalLM | 4 | 91.4363 | 112.5058 | 114.6257 | nan | 54.1584 | 56.2138 | | LayoutLMForSequenceClassification | 16 | 96.9989 | 98.9337 | 125.4088 | nan | 53.0309 | 54.2576 | | XGLMForCausalLM | 8 | 86.3079 | 123.8077 | 93.8768 | nan | 52.5341 | 58.5766 | | BertForQuestionAnswering | 16 | 94.6861 | 97.5635 | 123.882 | nan | 52.0759 | 53.4104 | | RobertaForQuestionAnswering | 16 | 95.0488 | 97.1044 | 123.1695 | nan | 52.0551 | 53.2644 | | DebertaForMaskedLM | 4 | 72.368 | 86.7339 | 79.9128 | nan | 51.2241 | 56.8585 | | BlenderbotSmallForCausalLM | 64 | 58.7419 | 63.451 | 81.6412 | nan | 48.4367 | 48.0251 | | ElectraForCausalLM | 32 | 87.1263 | 92.5441 | 121.9039 | nan | 47.9259 | 47.8058 | | MT5ForConditionalGeneration | 16 | 92.7973 | 126.2801 | 86.0555 | 109.0445 | 40.4142 | 47.4144 | | Speech2Text2ForCausalLM | 256 | 53.098 | 55.9321 | 77.0944 | 59.0975 | 39.9336 | 38.8178 | | GPT2ForSequenceClassification | 4 | 90.3985 | 92.8005 | nan | 180.9187 | 39.8452 | 40.8307 | | BlenderbotForCausalLM | 4 | 90.9647 | nan | nan | nan | nan | 79.3652 | | AllenaiLongformerBase | 4 | 203.0404 | 216.3057 | 225.0195 | nan | nan | nan | | YituTechConvBert | 16 | 133.5468 | 137.6252 | 168.5028 | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 0.9992 | 0.0 | 0.0 | 0.0 | 2.1732 | 1.7691 | | tnt_s_patch16_224 | 128 | 0.9996 | 0.9977 | 0.0 | 0.0 | 1.9912 | 1.9635 | | twins_pcpvt_base | 64 | 1.0047 | 0.9257 | 0.9173 | 0.0 | 1.8779 | 1.6991 | | coat_lite_mini | 128 | 1.0 | 0.9955 | 0.845 | 1.1029 | 1.7773 | 1.7415 | | ghostnet_100 | 128 | 1.0035 | 0.9772 | 0.9245 | 1.0151 | 1.6406 | 1.4004 | | regnety_002 | 128 | 0.9791 | 0.9418 | 1.2097 | 0.8577 | 1.5987 | 1.2205 | | volo_d1_224 | 64 | 0.9997 | 0.9937 | 0.8414 | 0.0 | 1.5918 | 1.5623 | | lcnet_050 | 128 | 0.9671 | 0.9567 | 0.8591 | 1.0289 | 1.5626 | 1.2974 | | gmixer_24_224 | 128 | 1.0 | 0.8797 | 0.7211 | 0.9243 | 1.5302 | 1.4527 | | gmlp_s16_224 | 128 | 0.9997 | 0.9949 | 0.7846 | 1.0107 | 1.53 | 1.5053 | | cait_m36_384 | 4 | 1.0002 | 0.8685 | 0.0 | 0.0 | 1.4806 | 1.4218 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9605 | 0.0 | 0.0 | 1.4474 | 1.4445 | | jx_nest_base | 32 | 0.9999 | 0.9922 | 0.801 | 0.0 | 1.3927 | 1.3565 | | convit_base | 64 | 1.0001 | 0.9964 | 0.833 | 1.2225 | 1.354 | 1.3119 | | crossvit_9_240 | 128 | 0.9998 | 0.9948 | 0.8382 | 0.9172 | 1.3271 | 1.3068 | | dm_nfnet_f0 | 128 | 0.9981 | 0.9982 | 0.8791 | 0.9195 | 1.3111 | 1.264 | | pit_b_224 | 64 | 0.9996 | 0.9953 | 0.8213 | 0.9724 | 1.2785 | 1.2727 | | nfnet_l0 | 128 | 0.9997 | 0.8096 | 0.7134 | 0.8499 | 1.2778 | 1.2113 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9918 | 0.7933 | 0.9749 | 1.2778 | 1.2626 | | mixer_b16_224 | 128 | 0.9999 | 0.9976 | 0.8 | 0.8982 | 1.2757 | 1.2734 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9775 | 0.0 | 0.0 | 1.2148 | 1.2036 | | hrnet_w18 | 128 | 1.0035 | 1.0221 | 0.8639 | 0.0 | 1.2072 | 1.0929 | | convnext_base | 64 | 0.9991 | 0.9952 | 0.7971 | 0.0 | 1.1781 | 1.2086 | | resmlp_12_224 | 128 | 1.0001 | 0.9991 | 0.7818 | 1.4831 | 1.1599 | 1.1373 | | adv_inception_v3 | 128 | 0.9999 | 0.9962 | 0.8518 | 1.1412 | 1.1468 | 1.0951 | | mobilenetv3_large_100 | 128 | 0.9551 | 0.9448 | 0.7939 | 0.9699 | 1.1363 | 0.9466 | | inception_v3 | 128 | 0.9999 | 0.9963 | 0.8526 | 1.1428 | 1.1362 | 1.0852 | | vit_base_patch16_224 | 64 | 0.9998 | 0.9935 | 0.8344 | 0.9086 | 1.1237 | 1.1168 | | gluon_inception_v3 | 128 | 0.9999 | 0.9961 | 0.8534 | 1.1413 | 1.1151 | 1.0804 | | mnasnet_100 | 128 | 0.9523 | 0.9446 | 0.7882 | 1.2067 | 1.096 | 1.0426 | | pnasnet5large | 16 | 1.0047 | 1.0375 | 0.8537 | 0.0 | 1.088 | 1.0617 | | fbnetv3_b | 128 | 0.9544 | 0.9416 | 0.772 | 0.0 | 1.072 | 0.9818 | | fbnetc_100 | 128 | 0.9528 | 0.9436 | 0.7917 | 1.1206 | 1.0525 | 1.0603 | | tf_mixnet_l | 128 | 0.9801 | 0.9092 | 0.7942 | 0.0 | 1.0362 | 1.0409 | | mixnet_l | 128 | 0.9798 | 0.9049 | 0.7933 | 0.0 | 1.0326 | 0.9865 | | poolformer_m36 | 64 | 0.9997 | 0.9981 | 0.8006 | 0.0 | 0.9662 | 0.9631 | | spnasnet_100 | 128 | 0.9465 | 0.938 | 0.7744 | 1.0957 | 0.9529 | 0.9637 | | selecsls42b | 128 | 0.9998 | 0.9952 | 0.8408 | 1.2827 | 0.8812 | 0.8548 | | repvgg_a2 | 128 | 0.943 | 0.934 | 0.7978 | 1.0687 | 0.8603 | 0.7584 | | cspdarknet53 | 64 | 0.9415 | 0.934 | 0.7557 | 1.1421 | 0.8258 | 0.8258 | | dla102 | 128 | 1.0002 | 0.9959 | 0.8369 | 1.3129 | 0.8177 | 0.8404 | | res2net101_26w_4s | 64 | 1.0014 | 1.0043 | 0.9689 | 0.0 | 0.8035 | 0.7442 | | gernet_l | 128 | 0.9473 | 0.9382 | 0.7673 | 1.059 | 0.7987 | 0.7642 | | tf_efficientnet_b0 | 128 | 0.9659 | 0.8075 | 0.6657 | 0.9502 | 0.7967 | 0.9185 | | mobilevit_s | 64 | 0.9733 | 0.8145 | 0.6561 | 0.0 | 0.7962 | 0.7603 | | tinynet_a | 128 | 0.964 | 0.8047 | 0.6567 | 0.7894 | 0.7777 | 0.8033 | | resnest101e | 64 | 0.9999 | 0.9905 | 0.81 | 0.0 | 0.7502 | 0.7386 | | visformer_small | 128 | 0.9997 | 1.0013 | 0.8413 | 0.0 | 0.7343 | 0.751 | | convmixer_768_32 | 32 | 0.9999 | 0.9981 | 0.9223 | 0.0 | 0.7244 | 0.7222 | | dpn107 | 32 | 0.9393 | 0.9265 | 0.7522 | 0.0 | 0.7165 | 0.7794 | | sebotnet33ts_256 | 64 | 0.9666 | 0.8366 | 0.6789 | 0.9618 | 0.7106 | 0.737 | | gluon_xception65 | 32 | 0.9998 | 0.9886 | 0.7535 | 0.0 | 0.7097 | 0.6396 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9651 | 0.768 | 1.1299 | 0.7089 | 0.654 | | swsl_resnext101_32x16d | 32 | 0.9994 | 0.9805 | 0.8046 | 0.0 | 0.6821 | 0.6621 | | eca_botnext26ts_256 | 128 | 0.9811 | 0.8115 | 0.6719 | 1.0716 | 0.6772 | 0.589 | | rexnet_100 | 128 | 0.9644 | 0.8508 | 0.6891 | 0.0 | 0.665 | 0.6766 | | res2net50_14w_8s | 128 | 0.9997 | 0.9924 | 0.8259 | 0.996 | 0.6495 | 0.6665 | | mobilenetv2_100 | 128 | 0.9515 | 0.941 | 0.7203 | 1.1239 | 0.6446 | 0.718 | | res2next50 | 128 | 0.9996 | 0.9951 | 0.8327 | 1.1454 | 0.6058 | 0.6478 | | botnet26t_256 | 128 | 0.9798 | 0.9752 | 0.812 | 1.2676 | 0.5853 | 0.7098 | | eca_halonext26ts | 128 | 0.9814 | 0.817 | 0.6744 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | cait_m36_384 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | ghostnet_100 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.9025 | 33.0579 | 60.2099 | nan | 215.1034 | 206.6616 | | res2net50_14w_8s | 128 | 3.0912 | 16.0997 | 26.7883 | 319.3918 | 201.8613 | 197.5206 | | pnasnet5large | 16 | 5.3927 | 24.6172 | 43.7671 | nan | 198.1756 | 193.747 | | ghostnet_100 | 128 | 3.5568 | 10.6747 | 15.9748 | 187.7565 | 198.0537 | 196.0626 | | res2net101_26w_4s | 64 | 3.5108 | 17.8358 | 29.9033 | nan | 138.2807 | 136.6565 | | dpn107 | 32 | 4.3786 | 14.7237 | 36.0684 | nan | 136.4089 | 132.843 | | rexnet_100 | 128 | 2.167 | 8.3658 | 16.8117 | nan | 125.0278 | 122.9583 | | twins_pcpvt_base | 64 | 2.9504 | 16.1083 | 26.884 | nan | 123.7006 | 122.246 | | mobilevit_s | 64 | 2.0451 | 8.2578 | 15.1197 | nan | 118.9846 | 115.8305 | | fbnetv3_b | 128 | 3.6058 | 12.4731 | 29.5357 | nan | 117.5429 | 115.776 | | resnest101e | 64 | 3.7426 | 17.4793 | 28.6484 | nan | 117.1992 | 113.7224 | | tinynet_a | 128 | 2.4031 | 8.6858 | 19.3787 | 197.1436 | 103.7281 | 101.1372 | | tf_mixnet_l | 128 | 3.9472 | 11.8465 | 23.8233 | nan | 103.2427 | 102.1706 | | mixnet_l | 128 | 3.4485 | 11.4102 | 23.2883 | nan | 102.9893 | 102.1985 | | gluon_inception_v3 | 128 | 1.7929 | 9.2786 | 14.1569 | 179.2817 | 102.3433 | 99.5532 | | inception_v3 | 128 | 1.8325 | 9.2358 | 14.0367 | 177.9035 | 101.5529 | 99.5341 | | adv_inception_v3 | 128 | 1.8293 | 9.279 | 14.5093 | 176.0145 | 101.4158 | 99.3248 | | fbnetc_100 | 128 | 2.2896 | 7.2624 | 16.9827 | 138.7759 | 94.7515 | 92.406 | | dla102 | 128 | 2.0499 | 10.2882 | 16.1183 | 241.5684 | 92.4918 | 90.5909 | | poolformer_m36 | 64 | 1.8547 | 8.0496 | 12.7548 | nan | 88.2767 | 83.0664 | | cspdarknet53 | 64 | 2.5482 | 8.0815 | 18.4731 | 144.6914 | 87.9499 | 85.0861 | | xcit_large_24_p8_224 | 5 | 3.5944 | nan | nan | nan | 87.9028 | 84.3662 | | spnasnet_100 | 128 | 2.2331 | 7.1325 | 16.5544 | 132.553 | 86.7695 | 86.2104 | | mobilenetv3_large_100 | 128 | 1.8515 | 6.2412 | 14.4158 | 140.9101 | 86.7668 | 83.5082 | | res2next50 | 128 | 1.7546 | 8.9165 | 13.8726 | 196.1626 | 84.636 | 82.6092 | | tf_efficientnet_b0 | 128 | 2.0764 | 7.4781 | 16.1982 | 178.9155 | 82.0382 | 82.007 | | swin_base_patch4_window7_224 | 64 | 3.5769 | 13.9632 | nan | nan | 81.1022 | 80.4512 | | sebotnet33ts_256 | 64 | 1.8766 | 6.8582 | 13.4794 | 149.7413 | 79.552 | 77.8855 | | gluon_xception65 | 32 | 2.2556 | 11.9086 | 18.9004 | nan | 76.9651 | 75.4354 | | mobilenetv2_100 | 128 | 1.8602 | 6.0419 | 13.6261 | 118.5689 | 76.5033 | 73.5315 | | cait_m36_384 | 4 | 3.8255 | 21.6944 | nan | nan | 74.7266 | 71.2268 | | mnasnet_100 | 128 | 1.9679 | 5.8334 | 13.1976 | 109.7497 | 74.6696 | 72.811 | | swsl_resnext101_32x16d | 32 | 2.0368 | 10.0322 | 15.6742 | nan | 72.9619 | 70.5155 | | coat_lite_mini | 128 | 1.4675 | 5.7557 | 9.2748 | 111.9494 | 72.8503 | 71.8567 | | regnety_002 | 128 | 1.8139 | 6.1618 | 13.111 | 113.1994 | 70.5119 | 68.7016 | | convnext_base | 64 | 1.5367 | 8.3718 | 12.83 | nan | 70.0315 | 69.3219 | | jx_nest_base | 32 | 2.0517 | 10.123 | 16.5257 | nan | 68.9408 | 66.0433 | | dm_nfnet_f0 | 128 | 2.3975 | 7.8914 | 11.3909 | 152.8902 | 68.6093 | 66.7874 | | eca_botnext26ts_256 | 128 | 1.614 | 5.4996 | 10.2832 | 120.6915 | 60.335 | 59.1024 | | visformer_small | 128 | 1.082 | 4.4605 | 6.5775 | nan | 59.7669 | 58.7316 | | botnet26t_256 | 128 | 1.546 | 4.7717 | 9.6319 | 94.8821 | 59.0056 | 56.7091 | | ese_vovnet19b_dw | 128 | 1.1426 | 3.3358 | 6.9782 | 69.1309 | 54.3945 | 54.0674 | | selecsls42b | 128 | 0.9462 | 4.0418 | 6.2598 | 90.6665 | 53.3247 | 52.5178 | | lcnet_050 | 128 | 1.229 | 3.7922 | 8.0026 | 79.7835 | 52.7064 | 51.7962 | | gernet_l | 128 | 2.2216 | 6.6769 | 15.3177 | 114.6091 | 52.1835 | 50.8323 | | nfnet_l0 | 128 | 2.1312 | 7.7538 | 11.7078 | 138.6591 | 50.7336 | 49.6317 | | gmlp_s16_224 | 128 | 1.3811 | 8.0617 | 13.2055 | 163.4016 | 46.3586 | 44.851 | | volo_d1_224 | 64 | 1.5183 | 8.4164 | 13.531 | nan | 43.3297 | 41.133 | | crossvit_9_240 | 128 | 1.9169 | 9.3167 | 13.9829 | 166.5674 | 42.9576 | 40.9861 | | tnt_s_patch16_224 | 128 | 2.0293 | 11.6225 | nan | nan | 42.1645 | 40.4392 | | repvgg_a2 | 128 | 2.2484 | 6.6367 | 15.3925 | 188.5064 | 38.156 | 36.8932 | | gmixer_24_224 | 128 | 1.7058 | 8.963 | 14.203 | 164.7094 | 38.1238 | 37.5083 | | convmixer_768_32 | 32 | 1.3625 | 6.8719 | 10.4391 | nan | 34.5438 | 32.4748 | | convit_base | 64 | 1.2567 | 6.496 | 10.0278 | 133.4378 | 31.6187 | 30.5882 | | mixer_b16_224 | 128 | 0.7737 | 4.0347 | 6.5575 | 80.0698 | 26.8377 | 25.3742 | | deit_base_distilled_patch16_224 | 64 | 1.1007 | 5.1472 | 8.1334 | 78.8433 | 25.421 | 23.8176 | | pit_b_224 | 64 | 1.2465 | 5.7622 | 8.8034 | 98.9408 | 25.3237 | 24.3391 | | resmlp_12_224 | 128 | 0.7749 | 3.4345 | 5.1382 | 55.1757 | 25.1654 | 23.934 | | vit_base_patch16_224 | 64 | 1.1513 | 5.0766 | 7.777 | 81.517 | 24.6081 | 23.0678 | | beit_base_patch16_224 | 64 | 1.4033 | 6.286 | nan | nan | 23.289 | 22.0794 | | eca_halonext26ts | 128 | 1.6694 | 5.6142 | 11.1122 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3054 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3561 | 1.3557 | 1.284 | 1.2997 | | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.1635 | 1.4912 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0842 | 1.1492 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.0576 | 1.2943 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0528 | 1.1534 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.045 | 1.3028 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.038 | 1.1389 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2852 | nan | 1.0012 | 1.2582 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 1.0004 | 1.0447 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9907 | 1.2271 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 0.9796 | 0.9842 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3402 | nan | 0.9745 | 1.0806 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 0.9567 | 1.1357 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2875 | nan | 0.9484 | 1.057 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9464 | 0.9678 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 0.9295 | 1.0969 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9289 | 0.9803 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.9288 | 0.9904 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 0.9184 | 1.2283 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2684 | 0.3766 | 0.9135 | 1.123 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.9088 | 0.9818 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9066 | 0.9846 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 0.8962 | 1.1046 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3594 | 1.222 | 0.8916 | 0.8968 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8911 | 0.8962 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8815 | 0.98 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2669 | 0.476 | 0.8765 | 1.1944 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.8723 | 1.0162 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.8648 | 1.0056 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3343 | 0.8578 | 0.8599 | 0.9862 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8599 | 0.9862 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3675 | nan | 0.852 | 0.9728 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3531 | nan | 0.8455 | 0.9441 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8442 | 0.965 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8368 | 0.9122 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.8175 | 1.1003 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8146 | 0.9442 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.8092 | 0.8239 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.8041 | 1.0134 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.8022 | 1.0059 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.7927 | 0.9534 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.787 | 0.9294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7834 | 1.0066 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.7727 | 0.9234 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.7713 | 0.9528 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.7706 | 1.0052 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.7697 | 0.9414 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.7607 | 0.9414 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.7566 | 0.9257 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.75 | 0.9634 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.7318 | 0.8133 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3443 | 0.8161 | 0.7239 | 0.9336 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.7101 | 0.9306 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.8188 | 0.6955 | 0.8352 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6668 | 0.8553 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.6615 | 0.9433 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.5858 | 0.8993 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5572 | 0.8383 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.9765 | 297.1603 | 321.8622 | nan | 409.5828 | 410.7983 | | hrnet_w18 | 128 | 295.5132 | 291.0483 | 346.6461 | nan | 250.1402 | 277.3807 | | res2net50_14w_8s | 128 | 145.9952 | 146.8704 | 181.4983 | 146.559 | 238.0419 | 220.5676 | | res2next50 | 128 | 138.6761 | 138.7374 | 166.446 | 120.8854 | 228.2046 | 213.1311 | | dla102 | 128 | 178.4143 | 179.3673 | 213.3889 | 135.9678 | 218.0316 | 211.944 | | resnest101e | 64 | 164.6524 | 165.4732 | 200.4742 | nan | 217.0688 | 220.5576 | | pnasnet5large | 16 | 218.8047 | 213.7001 | 258.9435 | nan | 206.1977 | 212.3902 | | tf_mixnet_l | 128 | 195.2779 | 210.2809 | 240.7185 | nan | 184.8941 | 183.8969 | | tnt_s_patch16_224 | 128 | 363.6223 | 364.1922 | nan | nan | 182.4028 | 184.8961 | | botnet26t_256 | 128 | 105.9182 | 106.3846 | 127.9109 | 81.7839 | 177.3305 | 146.0054 | | mixnet_l | 128 | 186.4642 | 202.0404 | 230.4516 | nan | 177.2894 | 185.4535 | | swsl_resnext101_32x16d | 32 | 118.5749 | 120.1427 | 146.4451 | nan | 172.5356 | 178.1725 | | eca_botnext26ts_256 | 128 | 111.8215 | 135.419 | 163.7273 | 102.4869 | 162.1614 | 186.1446 | | res2net101_26w_4s | 64 | 120.7101 | 121.5923 | 126.7719 | nan | 155.3518 | 166.9582 | | poolformer_m36 | 64 | 148.6099 | 149.1346 | 185.6509 | nan | 153.9228 | 154.4681 | | dpn107 | 32 | 113.6253 | 116.1699 | 143.1648 | nan | 151.7821 | 146.6549 | | gluon_inception_v3 | 128 | 161.2865 | 161.8983 | 189.1671 | 141.3041 | 144.7496 | 149.2473 | | inception_v3 | 128 | 160.4804 | 161.3953 | 188.6667 | 140.528 | 141.4372 | 148.1992 | | adv_inception_v3 | 128 | 161.2758 | 162.0325 | 189.5145 | 141.2303 | 140.6534 | 147.1422 | | gluon_xception65 | 32 | 98.0623 | 98.7591 | 129.7575 | nan | 137.7818 | 152.8539 | | convit_base | 64 | 181.2855 | 182.0474 | 217.5169 | 148.3333 | 133.9952 | 138.2148 | | visformer_small | 128 | 98.1327 | 98.1227 | 116.7138 | nan | 133.4591 | 130.0812 | | rexnet_100 | 128 | 91.0253 | 103.1762 | 127.5198 | nan | 131.9366 | 129.7322 | | pit_b_224 | 64 | 154.8091 | 155.5064 | 188.2582 | 159.0666 | 121.0151 | 121.624 | | sebotnet33ts_256 | 64 | 83.2208 | 95.9341 | 118.5186 | 83.4849 | 113.1779 | 109.1405 | | cait_m36_384 | 4 | 166.6571 | 191.1706 | nan | nan | 112.5695 | 116.6718 | | fbnetv3_b | 128 | 121.0003 | 122.4612 | 149.4198 | nan | 112.3339 | 119.4149 | | beit_base_patch16_224 | 64 | 135.0471 | 138.0991 | nan | nan | 111.1395 | 112.1189 | | mobilevit_s | 64 | 89.8711 | 107.3113 | 133.5833 | nan | 110.0205 | 114.9353 | | cspdarknet53 | 64 | 95.9054 | 96.7228 | 119.8504 | 79.1033 | 109.5513 | 109.2283 | | tf_efficientnet_b0 | 128 | 90.5524 | 108.3244 | 131.4491 | 92.0823 | 109.5291 | 95.187 | | vit_base_patch16_224 | 64 | 120.6883 | 121.3169 | 144.401 | 132.5006 | 107.3128 | 107.9332 | | convnext_base | 64 | 121.5635 | 121.8172 | 152.1672 | nan | 103.1619 | 100.2979 | | swin_base_patch4_window7_224 | 64 | 147.5325 | 153.0581 | nan | nan | 101.9103 | 102.06 | | dm_nfnet_f0 | 128 | 131.9499 | 131.2511 | 148.8575 | 142.7754 | 99.9614 | 103.6037 | | mobilenetv2_100 | 128 | 67.5966 | 68.3734 | 89.3445 | 57.2549 | 99.8321 | 89.5722 | | tinynet_a | 128 | 75.2571 | 89.8741 | 110.9047 | 98.0392 | 95.0159 | 92.2273 | | gernet_l | 128 | 79.5899 | 80.4346 | 98.7119 | 71.2605 | 94.7628 | 98.7336 | | mixer_b16_224 | 128 | 118.6015 | 118.6567 | 148.2001 | 131.7561 | 93.4257 | 93.309 | | ese_vovnet19b_dw | 128 | 67.8858 | 68.2027 | 85.9244 | 58.2631 | 92.9527 | 100.7197 | | gmlp_s16_224 | 128 | 136.0235 | 136.6572 | 173.5453 | 134.7496 | 89.0955 | 90.3276 | | repvgg_a2 | 128 | 79.6247 | 80.4721 | 94.258 | 70.3004 | 87.5278 | 99.1579 | | jx_nest_base | 32 | 118.8945 | 119.782 | 148.3153 | nan | 85.3918 | 87.5894 | | volo_d1_224 | 64 | 134.7602 | 135.1353 | 159.9781 | nan | 84.4106 | 86.0502 | | nfnet_l0 | 128 | 106.3555 | 131.1556 | 148.8421 | 124.3022 | 83.0714 | 87.6159 | | crossvit_9_240 | 128 | 109.1145 | 109.8358 | 130.5255 | 119.1668 | 82.4151 | 83.4044 | | fbnetc_100 | 128 | 87.9214 | 88.7111 | 105.8794 | 74.6735 | 79.6513 | 79.0491 | | gmixer_24_224 | 128 | 119.6921 | 136.268 | 166.5746 | 129.914 | 78.5273 | 82.446 | | spnasnet_100 | 128 | 76.5813 | 77.1983 | 93.6047 | 66.1525 | 76.0301 | 75.1079 | | deit_base_distilled_patch16_224 | 64 | 94.1639 | 94.8626 | 118.8019 | 96.5203 | 73.8381 | 74.556 | | selecsls42b | 128 | 62.7072 | 63.0109 | 74.7565 | 48.946 | 71.4269 | 73.3899 | | twins_pcpvt_base | 64 | 124.6161 | 136.5308 | 138.1412 | nan | 69.4763 | 83.9345 | | coat_lite_mini | 128 | 115.9536 | 116.3039 | 137.1607 | 105.177 | 65.3158 | 66.603 | | xcit_large_24_p8_224 | 5 | 132.0175 | nan | nan | nan | 61.4166 | 86.7336 | | mnasnet_100 | 128 | 70.241 | 70.6809 | 84.8726 | 55.3151 | 61.0413 | 64.135 | | ghostnet_100 | 128 | 95.4073 | 98.3601 | 110.6314 | 95.2827 | 59.8192 | 69.7108 | | resmlp_12_224 | 128 | 68.1313 | 68.2086 | 87.2355 | 45.9318 | 58.7959 | 59.9866 | | mobilenetv3_large_100 | 128 | 65.9659 | 66.5786 | 82.6238 | 64.8675 | 55.4929 | 67.2547 | | regnety_002 | 128 | 54.1266 | 55.5164 | 47.0275 | 61.9508 | 34.9572 | 43.7325 | | lcnet_050 | 128 | 34.1881 | 36.8471 | 39.7607 | 32.0272 | 21.2499 | 27.3596 | | eca_halonext26ts | 128 | 115.7669 | 139.3693 | 168.9013 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/torchbench_amp.png : ![](https://i.imgur.com/mATn2Ux.png) bench_logs/huggingface_amp.png : ![](https://i.imgur.com/SeyCDw8.png) bench_logs/timm_models_amp.png : ![](https://i.imgur.com/S3RKjCP.png)

anijain2305 commented 1 year ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 55/56 | 100%, 46/46 | 100%, 61/61 |
|       aot_eager        | 95%, 53/56 | 100%, 46/46 | 100%, 61/61 |
|     aot_cudagraphs     | 75%, 42/56 | 37%, 17/46  | 46%, 28/61  |
|    nvprims_nvfuser     | 77%, 43/56 | 61%, 28/46  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 85%, 39/46  | 95%, 58/61  |
| inductor_no_cudagraphs | 89%, 50/56 | 93%, 43/46  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.01x    |    1.00x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.12x    |    1.00x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.04x    |    1.14x    |
|        inductor        |   1.47x    |    1.23x    |    1.23x    |
| inductor_no_cudagraphs |   1.24x    |    1.21x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    1.74    |    2.79     |    2.02     |
|       aot_eager        |    5.90    |    8.93     |    7.71     |
|     aot_cudagraphs     |    8.52    |    16.14    |    13.43    |
|    nvprims_nvfuser     |   59.77    |    86.65    |   139.56    |
|        inductor        |   32.89    |    37.33    |    37.13    |
| inductor_no_cudagraphs |   32.84    |    31.43    |    36.02    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.98x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.92x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.90x    |    1.01x    |    0.95x    |
|        inductor        |   0.83x    |    0.74x    |    0.97x    |
| inductor_no_cudagraphs |   0.99x    |    1.00x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Previous report name: /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_float32_565 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 84%, 47/56 | 84%, 47/56 | | inductor | huggingface | 87%, 40/46 | 85%, 39/46 | | inductor | timm_models | 93%, 57/61 | 95%, 58/61 | | inductor_no_cudagraphs | torchbench | 89%, 50/56 | 89%, 50/56 | | inductor_no_cudagraphs | huggingface | 93%, 43/46 | 93%, 43/46 | | inductor_no_cudagraphs | timm_models | 93%, 57/61 | 95%, 58/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.47x | 1.46x | | inductor | huggingface | 1.23x | 1.23x | | inductor | timm_models | 1.23x | 1.23x | | inductor_no_cudagraphs | torchbench | 1.23x | 1.23x | | inductor_no_cudagraphs | huggingface | 1.21x | 1.22x | | inductor_no_cudagraphs | timm_models | 1.23x | 1.23x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+---------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+---------------------------------+---------------+------------------------+ | torchbench | tacotron2 | fail_to_run | pass | | torchbench | functorch_dp_cifar10 | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | resnet50_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v2_quantized_qat | fail_accuracy | fail_accuracy | | torchbench | vision_maskrcnn | 0.0000 | 0.0000 | | huggingface | DebertaV2ForQuestionAnswering | fail_to_run | pass | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | deit_base_distilled_patch16_224 | pass | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | resnest101e | fail_accuracy | fail_accuracy | +-------------+---------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7428 | 0.9349 | | torchbench | soft_actor_critic | 1.3854 | 0.9046 | | torchbench | nvidia_deeprecommender | 0.8349 | 0.8834 | | torchbench | hf_GPT2_large | 0.0 | 1.4698 | | torchbench | hf_T5 | 0.0 | 1.5265 | | torchbench | tacotron2 | 0.0 | 0.8986 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | DebertaV2ForQuestionAnswering | 1.0002 | 0.9042 | | huggingface | DebertaV2ForMaskedLM | 0.9894 | 0.8249 | | huggingface | TrOCRForCausalLM | 0.0 | 1.0262 | | huggingface | AlbertForMaskedLM | 0.0 | 1.2511 | | huggingface | BlenderbotForCausalLM | 0.0 | 1.0129 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5086 | +-------------+-------------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------------------+----------+------------------------+ | torchbench | yolov3 | 363.3223 | 363.5352 | | torchbench | timm_efficientdet | 122.1681 | 119.8486 | | huggingface | XLNetLMHeadModel | 159.6013 | 164.8345 | | huggingface | DebertaV2ForQuestionAnswering | 135.5197 | 49.4771 | | huggingface | DebertaV2ForMaskedLM | 134.523 | 50.052 | +-------------+-------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0023 | | torchbench | mobilenet_v3_large | 0.8675 | 0.896 | | torchbench | hf_T5_large | 0.8643 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8621 | 1.031 | | torchbench | densenet121 | 0.857 | 1.0006 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | pytorch_unet | 0.8484 | 1.0138 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8325 | 1.1284 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.8263 | 1.0815 | | torchbench | dlrm | 0.7932 | 0.8152 | | torchbench | hf_Albert | 0.7685 | 1.2076 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7507 | 0.8214 | | torchbench | soft_actor_critic | 0.7501 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7061 | 1.0275 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6881 | 0.913 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | hf_Reformer | 0.577 | 1.0026 | | torchbench | lennard_jones | 0.5647 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4213 | 0.4334 | | torchbench | dcgan | 0.2564 | 0.2576 | | huggingface | YituTechConvBert | 0.894 | 0.9822 | | huggingface | DistillGPT2 | 0.8939 | 1.0108 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4307 | | huggingface | PegasusForConditionalGeneration | 0.8637 | 1.0262 | | huggingface | M2M100ForConditionalGeneration | 0.8606 | 1.0096 | | huggingface | PLBartForCausalLM | 0.8367 | 1.0581 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | T5ForConditionalGeneration | 0.8129 | 1.1049 | | huggingface | T5Small | 0.8129 | 1.1049 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | MBartForConditionalGeneration | 0.7896 | 0.9837 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | MT5ForConditionalGeneration | 0.7748 | 0.9324 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.958 | | huggingface | MegatronBertForQuestionAnswering | 0.7709 | 1.0379 | | huggingface | MegatronBertForCausalLM | 0.7673 | 1.0153 | | huggingface | MBartForCausalLM | 0.7326 | 0.9478 | | huggingface | RobertaForQuestionAnswering | 0.7273 | 1.0274 | | huggingface | BertForQuestionAnswering | 0.7273 | 1.0273 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0297 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | BertForMaskedLM | 0.6945 | 0.9772 | | huggingface | CamemBert | 0.6942 | 0.9746 | | huggingface | RobertaForCausalLM | 0.6942 | 0.9771 | | huggingface | Speech2Text2ForCausalLM | 0.675 | 0.9168 | | huggingface | DistilBertForQuestionAnswering | 0.6589 | 0.9118 | | huggingface | DistilBertForMaskedLM | 0.6509 | 0.9194 | | huggingface | DebertaV2ForMaskedLM | 0.5682 | 0.9491 | | huggingface | MobileBertForMaskedLM | 0.4951 | 0.6649 | | huggingface | DebertaV2ForQuestionAnswering | 0.4735 | 0.984 | | huggingface | MobileBertForQuestionAnswering | 0.4145 | 0.535 | | huggingface | DebertaForMaskedLM | 0.3862 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1339 | | huggingface | BlenderbotForCausalLM | nan | 0.8509 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8525 | 1.0752 | | timm_models | convnext_base | 0.8485 | 1.0335 | | timm_models | sebotnet33ts_256 | 0.8189 | 0.9416 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | coat_lite_mini | 0.8154 | 1.0235 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7449 | 0.9008 | | timm_models | crossvit_9_240 | 0.6745 | 0.9137 | | timm_models | tnt_s_patch16_224 | nan | 0.8633 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_float32_285 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_float32_441 No regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0035 | 1.0256 | 2.3224 | 0.7859 | 5.2821 | 1.2907 | | timm_efficientdet | 1 | 0.9834 | 0.8999 | 1.8222 | 0.7997 | 4.4603 | 1.57 | | timm_vision_transformer | 8 | 1.0041 | 0.9453 | 1.5436 | 0.6548 | 2.6637 | 1.4486 | | drq | 1 | 1.0129 | 0.8756 | 1.6045 | 0.7316 | 2.5029 | 1.0554 | | BERT_pytorch | 16 | 1.0135 | 0.9018 | 1.1267 | 0.9574 | 2.2598 | 2.2102 | | resnext50_32x4d | 8 | 0.9978 | 1.1101 | 1.3399 | 0.8014 | 2.0069 | 1.209 | | mobilenet_v3_large | 32 | 1.0056 | 1.1173 | 1.0429 | 0.8851 | 1.9997 | 1.377 | | resnet18 | 16 | 1.0045 | 1.1327 | 1.1648 | 0.8754 | 1.9838 | 1.2051 | | squeezenet1_1 | 32 | 0.9966 | 1.0189 | 1.0646 | 0.8737 | 1.96 | 1.3037 | | dcgan | 32 | 0.9755 | 1.0273 | 1.2572 | 0.7392 | 1.9451 | 1.0147 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9988 | 1.0235 | 1.3099 | 0.8566 | 1.9232 | 1.5707 | | pytorch_struct | 200 | 0.99 | 0.7592 | 0.912 | 0.8077 | 1.8054 | 1.1302 | | lennard_jones | 1000 | 0.9444 | 0.8476 | 1.0138 | 0.6676 | 1.7428 | 0.9349 | | hf_T5_large | 2 | 1.0247 | 0.9105 | 0.0 | 0.0 | 1.6787 | 1.8006 | | hf_Albert | 8 | 1.0004 | 0.9961 | 0.7517 | 1.5527 | 1.6541 | 1.6455 | | shufflenet_v2_x1_0 | 128 | 1.0 | 1.0669 | 0.8074 | 0.8873 | 1.6461 | 1.4472 | | hf_GPT2 | 4 | 1.0082 | 0.9803 | 0.7408 | 0.4034 | 1.5659 | 1.5006 | | timm_resnest | 32 | 0.9989 | 1.0028 | 0.8042 | 1.1652 | 1.5175 | 1.4512 | | mnasnet1_0 | 32 | 0.9982 | 1.0986 | 0.8977 | 0.9109 | 1.4736 | 1.2683 | | speech_transformer | 32 | 0.9958 | 0.8866 | 1.4025 | 0.7445 | 1.4303 | 1.4342 | | mobilenet_v2 | 96 | 0.9995 | 0.9987 | 0.7303 | 1.3334 | 1.4273 | 1.4091 | | fastNLP_Bert | 6 | 0.9967 | 0.977 | 0.7525 | 1.1492 | 1.4172 | 1.3858 | | soft_actor_critic | 256 | 0.9735 | 0.7767 | 1.055 | 0.6687 | 1.3854 | 0.9046 | | timm_efficientnet | 32 | 0.9558 | 0.8153 | 0.6928 | 0.8133 | 1.3409 | 1.2024 | | pytorch_stargan | 16 | 0.9985 | 1.0764 | 0.9341 | 0.0 | 1.2663 | 1.2267 | | LearningToPaint | 96 | 1.0023 | 1.0607 | 0.8586 | 0.9812 | 1.2505 | 1.2038 | | resnet152 | 32 | 1.0013 | 1.0983 | 0.7976 | 0.8947 | 1.2386 | 1.1988 | | hf_Bart | 4 | 1.0136 | 0.9763 | 0.7456 | 0.865 | 1.2137 | 1.1954 | | resnet50 | 32 | 0.9986 | 0.9922 | 0.76 | 0.9336 | 1.2119 | 1.168 | | hf_Bert | 4 | 1.0337 | 1.0003 | 0.7338 | 0.8401 | 1.2089 | 1.28 | | timm_nfnet | 128 | 0.9999 | 1.0003 | 0.0 | 1.1335 | 1.1973 | 1.1617 | | pytorch_unet | 1 | 0.9997 | 0.2822 | 0.0 | 0.0 | 1.1918 | 1.1774 | | hf_DistilBert | 8 | 1.0004 | 0.9394 | 0.6875 | 0.5267 | 1.1771 | 1.1804 | | vgg16 | 64 | 0.9996 | 0.9985 | 0.8587 | 0.9974 | 1.1734 | 1.1683 | | alexnet | 128 | 0.9991 | 0.998 | 0.8027 | 1.0014 | 1.1557 | 1.1571 | | Super_SloMo | 6 | 0.9997 | 0.2436 | 0.0 | 0.2484 | 1.1534 | 1.1394 | | hf_Reformer | 4 | 0.9975 | 1.0011 | 0.9889 | 0.6966 | 1.1311 | 1.1409 | | timm_regnet | 32 | 0.9648 | 0.9629 | 0.7823 | 1.0958 | 1.129 | 1.0887 | | Background_Matting | 4 | 1.0003 | 0.1926 | 0.0 | 0.0 | 1.0866 | 1.0763 | | yolov3 | 16 | 0.9999 | 0.9944 | 0.7921 | 1.1517 | 1.0824 | 1.073 | | mobilenet_v2_quantized_qat | 96 | 1.0008 | 0.98 | 0.0 | 1.4601 | 1.079 | 1.0799 | | attention_is_all_you_need_pytorch | 256 | 0.9996 | 0.9694 | 0.7561 | 0.9549 | 1.0555 | 1.0374 | | timm_vision_transformer_large | 8 | 1.0003 | 0.9947 | 0.0 | 0.0 | 1.046 | 1.0337 | | timm_vovnet | 32 | 0.9093 | 0.9038 | 0.714 | 0.9037 | 1.0022 | 1.0183 | | dlrm | 1024 | 1.5741 | 0.7423 | 0.0 | 0.6161 | 1.0011 | 1.298 | | demucs | 4 | 0.9997 | 0.9996 | 0.9999 | 1.0002 | 0.9996 | 0.9998 | | tts_angular | 64 | 0.9727 | 0.9582 | 0.9758 | 0.9475 | 0.9718 | 0.9992 | | resnet50_quantized_qat | 32 | 1.0021 | 0.9702 | 0.0 | 1.1582 | 0.9657 | 0.9789 | | nvidia_deeprecommender | 256 | 0.9986 | 0.9631 | 0.5847 | 0.9756 | 0.8349 | 0.8834 | | hf_GPT2_large | 4 | 0.9999 | 0.9813 | 0.0 | 0.0 | 0.0 | 1.4698 | | hf_T5 | 8 | 1.001 | 0.9523 | 0.0 | 1.1693 | 0.0 | 1.5265 | | tacotron2 | 64 | 0.955 | 0.8542 | 0.0 | 0.75 | 0.0 | 0.8986 | | functorch_dp_cifar10 | 64 | 0.9927 | 1.0315 | 2.0955 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.9221 | 0.8661 | 0.7961 | 0.0 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9471 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | 0.0000 | pass | pass | pass | | dlrm | 2 | pass | pass | 0.0000 | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | Background_Matting | 4 | pass | pass | fail_to_run | fail_to_run | pass | pass | | pytorch_unet | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | resnet50_quantized_qat | 2 | pass | pass | 0.0000 | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | 0.0000 | fail_accuracy | fail_accuracy | fail_accuracy | | vision_maskrcnn | 2 | pass | pass | 0.0000 | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 1.6354 | 6.2371 | 9.3627 | 110.9949 | 363.3223 | 363.5352 | | timm_efficientdet | 1 | 10.3237 | 26.0426 | 55.0578 | 441.0663 | 122.1681 | 119.8486 | | hf_T5_large | 2 | 13.4812 | 36.8089 | nan | nan | 112.4542 | 110.6742 | | mobilenet_v2_quantized_qat | 96 | 1.3755 | 8.5919 | nan | 177.824 | 85.4559 | 85.359 | | resnet50_quantized_qat | 32 | 1.2534 | 8.2354 | nan | 166.5138 | 70.2192 | 70.2589 | | timm_nfnet | 128 | 2.1975 | 6.6696 | nan | 142.0726 | 60.2962 | 59.5625 | | timm_efficientnet | 32 | 1.8671 | 6.0866 | 13.4762 | 107.4705 | 60.2895 | 59.8081 | | timm_resnest | 32 | 0.6103 | 2.2459 | 3.3552 | 60.0849 | 58.7028 | 57.2274 | | timm_vision_transformer_large | 8 | 2.5999 | 12.4801 | nan | nan | 54.5499 | 53.3992 | | mobilenet_v3_large | 32 | 0.9153 | 4.197 | 6.1182 | 95.5542 | 48.3793 | 49.6281 | | timm_regnet | 32 | 2.3729 | 7.1901 | 16.2193 | 108.6363 | 43.6304 | 42.949 | | densenet121 | 4 | 2.3437 | 10.8972 | 16.9486 | 147.8037 | 43.5116 | 42.7963 | | attention_is_all_you_need_pytorch | 256 | 1.2646 | 6.3837 | 9.4798 | 93.0568 | 43.3369 | 42.549 | | resnet152 | 32 | 2.4483 | 12.2524 | 18.497 | 161.0621 | 41.9274 | 41.0789 | | timm_vision_transformer | 8 | 0.8614 | 3.7827 | 5.2393 | 67.0337 | 31.9489 | 32.6024 | | hf_Bart | 4 | 1.7788 | 7.4426 | 11.0726 | 102.3463 | 30.9308 | 29.8215 | | pytorch_stargan | 16 | 0.4318 | 1.9075 | 2.6477 | nan | 27.7113 | 26.1894 | | BERT_pytorch | 16 | 1.556 | 6.4239 | 9.3342 | 74.0681 | 27.3944 | 26.8368 | | fastNLP_Bert | 6 | 1.5787 | 5.932 | 8.966 | 69.1483 | 26.8007 | 25.5367 | | speech_transformer | 32 | 1.7014 | 7.2884 | 24.8846 | 101.1197 | 25.4039 | 23.6491 | | hf_Bert | 4 | 1.6406 | 5.728 | 8.2478 | 78.0938 | 19.5256 | 19.7035 | | Super_SloMo | 6 | 1.1105 | 6.8699 | nan | 55.2499 | 19.4956 | 19.2889 | | pytorch_struct | 200 | 0.2587 | 0.6783 | 1.379 | 4.6237 | 19.0146 | 21.9733 | | shufflenet_v2_x1_0 | 128 | 0.9935 | 4.568 | 6.6067 | 87.0798 | 18.8474 | 17.9325 | | mnasnet1_0 | 32 | 0.8458 | 3.8725 | 5.8954 | 70.5897 | 18.7227 | 18.4436 | | hf_Albert | 8 | 1.3416 | 5.2165 | 8.0789 | 105.1684 | 18.5388 | 17.3497 | | hf_GPT2 | 4 | 1.6172 | 5.6075 | 8.1803 | 61.1582 | 18.4405 | 17.5203 | | timm_vovnet | 32 | 1.4367 | 4.0394 | 8.6816 | 57.1425 | 18.3505 | 17.9463 | | hf_Reformer | 4 | 1.7176 | 2.9292 | 5.1167 | 15.2783 | 18.2104 | 15.7085 | | resnet50 | 32 | 0.8881 | 4.0971 | 5.876 | 78.5284 | 18.1301 | 17.959 | | resnext50_32x4d | 8 | 0.9186 | 4.0134 | 5.9733 | 66.0118 | 17.7336 | 17.0276 | | Background_Matting | 4 | 0.7322 | 8.252 | nan | nan | 17.4055 | 17.5247 | | mobilenet_v2 | 96 | 0.8265 | 4.0583 | 6.1482 | 94.6303 | 17.0665 | 16.3163 | | hf_DistilBert | 8 | 0.6789 | 2.7723 | 4.9415 | 38.1696 | 12.2465 | 12.2699 | | resnet18 | 16 | 0.4288 | 1.6261 | 2.2408 | 29.7394 | 10.7359 | 10.4617 | | pytorch_unet | 1 | 0.4613 | 2.957 | nan | nan | 8.9936 | 8.9377 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4144 | 1.7428 | 2.391 | 32.0044 | 8.3895 | 8.2632 | | LearningToPaint | 96 | 0.443 | 1.6904 | 2.5138 | 39.1977 | 7.1633 | 6.8134 | | dcgan | 32 | 0.1799 | 0.3839 | 0.5851 | 4.5186 | 6.1646 | 5.5943 | | squeezenet1_1 | 32 | 0.2353 | 0.7117 | 1.1056 | 4.6681 | 4.969 | 4.6967 | | drq | 1 | 0.3155 | 0.5416 | 0.7812 | 4.7837 | 3.7751 | 3.3936 | | vgg16 | 64 | 0.1867 | 0.5198 | 0.797 | 3.1729 | 3.7561 | 3.6625 | | soft_actor_critic | 256 | 0.2182 | 0.3237 | 0.5263 | 1.7788 | 3.5326 | 2.8779 | | alexnet | 128 | 0.1588 | 0.3516 | 0.6004 | 3.1376 | 3.442 | 3.5139 | | nvidia_deeprecommender | 256 | 0.1893 | 0.3724 | 0.6036 | 5.1758 | 3.3248 | 3.174 | | dlrm | 1024 | 0.2617 | 0.5872 | nan | 3.3073 | 2.9688 | 2.7 | | lennard_jones | 1000 | 0.1388 | 0.2588 | 0.4024 | 1.5147 | 1.978 | 1.8162 | | tts_angular | 64 | 0.18 | 0.2282 | 0.3676 | 1.1598 | 1.9465 | 1.7284 | | demucs | 4 | 0.3048 | 0.3075 | 0.304 | 0.3045 | 0.2132 | 0.2089 | | tacotron2 | 64 | 5.2619 | 16.935 | nan | 49.2966 | nan | 45.8777 | | hf_GPT2_large | 4 | 5.5142 | 18.6632 | nan | nan | nan | 45.221 | | hf_T5 | 8 | 2.6187 | 8.1247 | nan | 68.783 | nan | 28.2063 | | hf_Longformer | 2 | 6.431 | 14.0754 | 57.0597 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 0.2991 | 1.1756 | 1.7422 | nan | nan | nan | | hf_BigBird | 2 | 3.6306 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5274 | 1.5274 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4604 | 1.4599 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | 0.988 | 1.3048 | 1.3922 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1744 | 1.2832 | | timm_efficientdet | 1 | 1.0111 | 0.823 | 0.2891 | 1.1336 | 1.1153 | 1.1438 | | Super_SloMo | 6 | 1.0024 | 0.902 | nan | 0.9454 | 1.1136 | 1.3409 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 1.0823 | 1.1864 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 1.0433 | 1.1066 | | speech_transformer | 32 | 0.9974 | 0.9772 | 0.2738 | 1.1209 | 1.0376 | 1.0448 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | hf_GPT2 | 4 | 1.0 | 0.906 | 0.3702 | 1.1242 | 0.9703 | 1.1698 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.7594 | 0.9436 | 1.0969 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9404 | 1.0771 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8549 | 0.9231 | 1.1042 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.392 | 0.8945 | 0.9183 | 0.9986 | | Background_Matting | 4 | 0.9998 | 0.8154 | nan | nan | 0.9107 | 1.0395 | | resnet152 | 32 | 0.9975 | 0.9153 | 0.3424 | 0.8736 | 0.9065 | 0.9672 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.7926 | 0.8982 | 1.0023 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8675 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8643 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | nan | 0.8621 | 1.031 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.857 | 1.0006 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8564 | 0.9343 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8531 | 0.8659 | | pytorch_unet | 1 | 0.9985 | 0.8222 | nan | nan | 0.8484 | 1.0138 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2124 | 0.8354 | 1.1229 | | hf_Bart | 4 | 1.0 | 0.8779 | 0.3388 | 1.0865 | 0.8325 | 1.1284 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3595 | 0.8196 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | 0.3503 | 1.1286 | 0.8263 | 1.0815 | | dlrm | 1024 | 0.8149 | 0.8149 | nan | 0.8147 | 0.7932 | 0.8152 | | hf_Albert | 8 | 1.0 | 0.949 | 0.2846 | 1.062 | 0.7685 | 1.2076 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3304 | 1.0652 | 0.7507 | 0.8214 | | soft_actor_critic | 256 | 0.9998 | 0.9638 | 0.4356 | 0.9637 | 0.7501 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8335 | | hf_Bert | 4 | 1.0 | 0.9011 | 0.3525 | 1.0004 | 0.7061 | 1.0275 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6268 | 0.6881 | 0.913 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 1.0 | 0.9042 | 0.3212 | 1.0228 | 0.6595 | 0.9466 | | hf_Reformer | 4 | 0.9999 | 0.9996 | 0.5934 | 0.9996 | 0.577 | 1.0026 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5647 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9676 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4213 | 0.4334 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.2564 | 0.2576 | | hf_GPT2_large | 4 | 1.0 | 0.8833 | nan | nan | nan | 1.1831 | | tacotron2 | 64 | 0.9903 | 1.0926 | nan | 1.114 | nan | 1.1617 | | hf_T5 | 8 | 1.0 | 0.9415 | nan | 0.9432 | nan | 1.1436 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_Longformer | 2 | 0.9999 | 0.9962 | 0.2947 | nan | nan | nan | | hf_BigBird | 2 | 0.907 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ | dlrm | 1024 | 145.7984 | 207.6899 | nan | 262.7884 | 216.5571 | 162.2796 | | timm_vision_transformer_large | 8 | 196.9506 | 198.0936 | nan | nan | 188.6645 | 190.8661 | | timm_nfnet | 128 | 206.5335 | 206.2 | nan | 181.6441 | 172.4207 | 177.1161 | | Background_Matting | 4 | 186.4344 | 968.7476 | nan | nan | 171.7295 | 173.3873 | | mobilenet_v2_quantized_qat | 96 | 147.5411 | 150.7503 | nan | 101.2533 | 136.9301 | 137.2624 | | hf_T5_large | 2 | 187.2027 | 217.8882 | nan | nan | 119.8538 | 124.4225 | | Super_SloMo | 6 | 117.7875 | 483.2367 | nan | 472.6073 | 101.8983 | 103.1831 | | resnet50_quantized_qat | 32 | 97.3767 | 96.7707 | nan | 81.5065 | 97.7174 | 96.7575 | | yolov3 | 16 | 102.3846 | 102.8534 | 129.3235 | 88.9406 | 94.8138 | 95.5003 | | vgg16 | 64 | 106.4614 | 106.3831 | 123.8938 | 106.476 | 90.5831 | 90.962 | | timm_regnet | 32 | 101.9266 | 102.003 | 125.2341 | 89.5559 | 86.8808 | 90.1499 | | demucs | 4 | 77.806 | 77.7525 | 77.7939 | 77.7412 | 77.6049 | 77.8608 | | resnet152 | 32 | 90.7643 | 86.1468 | 113.2426 | 102.338 | 73.6893 | 76.1503 | | hf_Reformer | 4 | 83.2929 | 83.0494 | 84.1279 | 119.8835 | 73.5427 | 73.02 | | attention_is_all_you_need_pytorch | 256 | 71.9634 | 74.4006 | 95.5739 | 75.3832 | 68.5621 | 69.4475 | | mobilenet_v2 | 96 | 71.3667 | 71.4805 | 97.6961 | 53.4669 | 50.0126 | 50.6425 | | pytorch_unet | 1 | 58.575 | 207.1999 | nan | nan | 49.1775 | 49.7137 | | hf_Bart | 4 | 54.3479 | 56.5784 | 74.7698 | 63.2855 | 46.0392 | 46.2304 | | hf_Albert | 8 | 75.1069 | 75.4387 | 100.2776 | 48.4396 | 45.6595 | 45.6648 | | fastNLP_Bert | 6 | 60.0437 | 60.9458 | 79.4481 | 51.866 | 42.2286 | 43.2084 | | speech_transformer | 32 | 49.5508 | 55.9815 | 35.4483 | 66.526 | 39.4781 | 35.6518 | | timm_vovnet | 32 | 42.4582 | 42.582 | 54.0164 | 42.6042 | 38.4424 | 37.8749 | | hf_GPT2 | 4 | 49.8996 | 51.5679 | 68.2068 | 124.3539 | 33.8149 | 33.5581 | | hf_DistilBert | 8 | 38.8612 | 41.441 | 56.6701 | 73.8128 | 33.1363 | 32.9321 | | hf_Bert | 4 | 38.1369 | 39.2135 | 53.5452 | 46.998 | 32.5045 | 33.208 | | timm_efficientdet | 1 | 140.6959 | 158.0133 | 80.8003 | 198.4466 | 32.4673 | 91.925 | | timm_efficientnet | 32 | 44.7572 | 52.553 | 61.119 | 52.4548 | 32.2628 | 36.246 | | resnet50 | 32 | 38.8118 | 38.9811 | 50.8716 | 41.8121 | 32.188 | 33.0711 | | shufflenet_v2_x1_0 | 128 | 37.5878 | 35.0988 | 45.7495 | 42.3971 | 23.7666 | 26.233 | | BERT_pytorch | 16 | 50.6429 | 52.6528 | 42.1733 | 49.6076 | 23.1342 | 26.0395 | | timm_resnest | 32 | 31.6816 | 31.5539 | 39.3459 | 27.1415 | 20.8394 | 21.778 | | mnasnet1_0 | 32 | 28.7476 | 26.2038 | 33.279 | 31.5734 | 19.4974 | 24.2338 | | pytorch_stargan | 16 | 24.2107 | 22.4361 | 25.9246 | nan | 19.1053 | 19.7165 | | mobilenet_v3_large | 32 | 31.8242 | 28.4183 | 30.5035 | 38.4358 | 16.1067 | 24.2519 | | resnext50_32x4d | 8 | 26.8473 | 24.2911 | 22.2291 | 34.191 | 13.4657 | 22.5428 | | densenet121 | 4 | 70.8017 | 64.3562 | 28.3011 | 84.9445 | 13.1112 | 53.9096 | | LearningToPaint | 96 | 15.6254 | 14.7326 | 18.2346 | 16.061 | 12.4748 | 13.1154 | | alexnet | 128 | 12.417 | 12.467 | 15.4724 | 12.3896 | 10.7258 | 10.7298 | | nvidia_deeprecommender | 256 | 8.5308 | 8.8536 | 14.5727 | 8.7498 | 10.1975 | 9.6454 | | tts_angular | 64 | 9.5613 | 9.749 | 9.5795 | 10.228 | 10.1032 | 9.4664 | | timm_vision_transformer | 8 | 23.7649 | 25.4412 | 15.6789 | 37.1878 | 9.5022 | 17.7649 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.8588 | 16.4571 | 12.6785 | 19.8456 | 8.7821 | 10.9824 | | squeezenet1_1 | 32 | 12.8425 | 12.3271 | 12.0473 | 14.653 | 6.9754 | 10.1943 | | resnet18 | 16 | 12.1567 | 11.5682 | 10.3559 | 15.201 | 6.6193 | 10.9969 | | pytorch_struct | 200 | 3.8414 | 5.0122 | 4.925 | 5.7453 | 2.127 | 3.3945 | | dcgan | 32 | 2.779 | 2.5721 | 2.1251 | 4.3096 | 1.3605 | 2.6224 | | drq | 1 | 2.9641 | 3.4479 | 1.8306 | 5.1386 | 1.2443 | 3.6159 | | soft_actor_critic | 256 | 1.0311 | 1.3514 | 1.0004 | 1.5725 | 0.7721 | 1.164 | | lennard_jones | 1000 | 1.1367 | 1.2847 | 1.1 | 1.6474 | 0.6538 | 1.2395 | | tacotron2 | 64 | 2915.841 | 3756.0577 | nan | 3637.6034 | nan | 3202.7772 | | hf_GPT2_large | 4 | 241.0688 | 246.0031 | nan | nan | nan | 163.7209 | | hf_T5 | 8 | 183.1427 | 191.7509 | nan | 156.3331 | nan | 120.0698 | | hf_Longformer | 2 | 149.3061 | 160.2704 | 175.4235 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 11.246 | 11.1528 | 5.5032 | nan | nan | nan | | hf_BigBird | 2 | 194.6826 | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | OPTForCausalLM | 2 | 0.9998 | 0.9317 | 0.0 | 0.8286 | 1.8011 | 1.8144 | | GPT2ForSequenceClassification | 4 | 0.9997 | 0.978 | 0.0 | 0.7016 | 1.7763 | 1.7623 | | XLNetLMHeadModel | 8 | 0.999 | 0.9636 | 0.0 | 0.0 | 1.6943 | 1.6968 | | MT5ForConditionalGeneration | 16 | 1.0228 | 0.9327 | 0.9788 | 1.0644 | 1.56 | 1.5159 | | GoogleFnet | 16 | 0.9994 | 0.9984 | 0.0 | 1.542 | 1.4471 | 1.559 | | DistillGPT2 | 16 | 0.9997 | 0.9527 | 0.0 | 0.9183 | 1.4377 | 1.4837 | | MobileBertForQuestionAnswering | 128 | 1.0257 | 0.951 | 0.0 | 0.0 | 1.4277 | 1.1123 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9751 | 0.0 | 1.1789 | 1.4275 | 1.4089 | | T5ForConditionalGeneration | 4 | 0.9988 | 0.9388 | 0.7267 | 1.1017 | 1.4162 | 1.4091 | | ElectraForCausalLM | 32 | 1.0003 | 0.9319 | 0.0 | 1.0178 | 1.4138 | 1.4502 | | T5Small | 4 | 1.0011 | 0.931 | 0.7266 | 1.1024 | 1.4126 | 1.4174 | | LayoutLMForSequenceClassification | 16 | 0.9996 | 0.9882 | 0.7376 | 1.1166 | 1.3109 | 1.2908 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9779 | 0.7331 | 1.1152 | 1.2865 | 1.273 | | BertForQuestionAnswering | 16 | 1.0001 | 0.9888 | 0.7326 | 1.1027 | 1.2839 | 1.2695 | | RobertaForCausalLM | 16 | 1.0 | 0.9707 | 0.0 | 1.053 | 1.2698 | 1.2734 | | AlbertForQuestionAnswering | 4 | 1.0009 | 1.0013 | 0.0 | 1.2336 | 1.2599 | 1.2597 | | MobileBertForMaskedLM | 64 | 1.0211 | 0.9376 | 0.804 | 0.0 | 1.2425 | 1.1885 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9922 | 0.0 | 1.0964 | 1.2191 | 1.2045 | | MegatronBertForCausalLM | 4 | 1.0001 | 0.9844 | 0.7258 | 1.0611 | 1.2102 | 1.1966 | | LayoutLMForMaskedLM | 16 | 1.0001 | 0.9695 | 0.0 | 1.0615 | 1.1913 | 1.1976 | | BertForMaskedLM | 16 | 1.0001 | 0.9694 | 0.0 | 1.0508 | 1.1751 | 1.1817 | | YituTechConvBert | 16 | 0.9998 | 0.9679 | 0.0 | 1.0032 | 1.1733 | 1.1714 | | CamemBert | 16 | 1.0001 | 0.9694 | 0.0 | 1.0603 | 1.1688 | 1.1757 | | PLBartForConditionalGeneration | 4 | 1.0 | 0.962 | 0.0 | 0.9566 | 1.1597 | 1.1629 | | DistilBertForQuestionAnswering | 256 | 1.0002 | 0.9996 | 0.0 | 0.788 | 1.1573 | 1.1531 | | XGLMForCausalLM | 8 | 1.0107 | 0.9443 | 0.738 | 0.3105 | 1.1478 | 1.1528 | | PLBartForCausalLM | 8 | 0.9998 | 0.9449 | 0.0 | 0.956 | 1.1334 | 1.186 | | MBartForConditionalGeneration | 2 | 1.0012 | 0.9875 | 0.0 | 1.0193 | 1.0962 | 1.0869 | | BartForConditionalGeneration | 2 | 1.0011 | 0.9882 | 0.0 | 0.4467 | 1.093 | 1.0854 | | MBartForCausalLM | 4 | 1.0005 | 0.9659 | 0.7543 | 0.9995 | 1.0851 | 1.0931 | | BartForCausalLM | 4 | 1.0002 | 0.9659 | 0.7518 | 1.0002 | 1.0813 | 1.0813 | | M2M100ForConditionalGeneration | 16 | 1.0468 | 0.958 | 0.7888 | 0.8691 | 1.08 | 1.054 | | DebertaForMaskedLM | 4 | 0.8765 | 0.79 | 0.7191 | 0.634 | 1.0708 | 1.0236 | | DebertaForQuestionAnswering | 8 | 0.9957 | 0.965 | 0.684 | 0.8348 | 1.0503 | 1.2176 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9404 | 0.0 | 0.9523 | 1.0376 | 1.045 | | PegasusForConditionalGeneration | 32 | 0.9988 | 0.9795 | 0.0 | 0.9782 | 1.0155 | 1.0122 | | DistilBertForMaskedLM | 128 | 0.9992 | 0.9478 | 0.0 | 0.783 | 1.0116 | 1.0312 | | DebertaV2ForQuestionAnswering | 2 | 0.8719 | 0.7667 | 0.0 | 0.6096 | 1.0002 | 0.9042 | | DebertaV2ForMaskedLM | 1 | 0.8555 | 0.7124 | 0.0 | 0.0 | 0.9894 | 0.8249 | | Speech2Text2ForCausalLM | 256 | 0.9976 | 0.9305 | 0.6525 | 0.9385 | 0.9892 | 1.0265 | | PegasusForCausalLM | 32 | 0.9991 | 0.954 | 0.7334 | 0.9489 | 0.9706 | 0.9812 | | BlenderbotSmallForCausalLM | 64 | 1.0004 | 0.9044 | 0.6828 | 0.9067 | 0.9515 | 0.9886 | | TrOCRForCausalLM | 32 | 0.9996 | 0.957 | 0.0 | 0.9658 | 0.0 | 1.0262 | | AlbertForMaskedLM | 4 | 1.0005 | 0.9995 | 0.0 | 1.2282 | 0.0 | 1.2511 | | BlenderbotForCausalLM | 4 | 1.0037 | 0.984 | 0.0 | 0.9554 | 0.0 | 1.0129 | | AllenaiLongformerBase | 4 | 0.9933 | 0.9468 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ | BlenderbotForCausalLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | DebertaV2ForMaskedLM | 1 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | GPT2ForSequenceClassification | 1 | pass | pass | 0.0000 | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaV2ForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | pass | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | XLNetLMHeadModel | 8 | 4.8856 | 18.5302 | nan | nan | 159.6013 | 164.8345 | | DebertaV2ForQuestionAnswering | 2 | 8.2681 | 16.9039 | nan | 94.1793 | 135.5197 | 49.4771 | | DebertaV2ForMaskedLM | 1 | 8.1726 | 16.9264 | nan | nan | 134.523 | 50.052 | | DebertaForMaskedLM | 4 | 5.0362 | 10.7311 | 34.8327 | 66.0862 | 89.8796 | 35.8129 | | DebertaForQuestionAnswering | 8 | 5.1186 | 11.0778 | 34.2422 | 69.3051 | 85.6723 | 37.0201 | | XGLMForCausalLM | 8 | 2.7952 | 11.2372 | 21.5534 | 162.4501 | 63.2053 | 60.7421 | | MobileBertForQuestionAnswering | 128 | 8.5719 | 24.8939 | nan | nan | 55.8955 | 52.7366 | | MobileBertForMaskedLM | 64 | 8.375 | 25.0166 | 42.2503 | nan | 53.7537 | 51.4946 | | M2M100ForConditionalGeneration | 16 | 3.4186 | 14.2347 | 19.6509 | 167.4793 | 52.5155 | 52.0956 | | MT5ForConditionalGeneration | 16 | 3.8623 | 11.8506 | 18.8403 | 107.0763 | 51.3101 | 50.821 | | BartForConditionalGeneration | 2 | 3.4388 | 13.9654 | nan | 197.1978 | 45.571 | 43.4804 | | PegasusForConditionalGeneration | 32 | 3.1852 | 13.7402 | nan | 178.9468 | 44.4387 | 41.2961 | | MBartForConditionalGeneration | 2 | 3.4802 | 14.0021 | nan | 218.523 | 43.6665 | 42.253 | | YituTechConvBert | 16 | 2.455 | 8.8692 | nan | 100.5731 | 39.4978 | 36.8643 | | MegatronBertForCausalLM | 4 | 3.4803 | 11.931 | 17.6372 | 164.5888 | 33.9261 | 32.8176 | | MegatronBertForQuestionAnswering | 8 | 3.3468 | 11.5112 | nan | 157.6009 | 33.7586 | 32.538 | | BlenderbotSmallForConditionalGeneration | 64 | 2.2015 | 9.1425 | nan | 128.8174 | 31.6728 | 30.588 | | T5Small | 4 | 2.572 | 8.3189 | 12.3271 | 69.7378 | 31.0073 | 30.179 | | T5ForConditionalGeneration | 4 | 2.561 | 8.2991 | 12.2978 | 69.5094 | 30.7799 | 30.3999 | | LayoutLMForSequenceClassification | 16 | 1.9403 | 6.1809 | 9.2286 | 76.7753 | 28.0847 | 26.8806 | | GoogleFnet | 16 | 0.9924 | 2.9738 | nan | 42.8162 | 27.3214 | 19.6332 | | ElectraForCausalLM | 32 | 1.6452 | 5.7996 | nan | 75.504 | 26.6917 | 24.4539 | | PLBartForConditionalGeneration | 4 | 1.7364 | 7.2378 | nan | 101.2202 | 26.6243 | 25.7557 | | PegasusForCausalLM | 32 | 1.3343 | 5.201 | 8.3991 | 77.8071 | 22.0007 | 20.7603 | | LayoutLMForMaskedLM | 16 | 1.992 | 6.33 | nan | 78.9287 | 21.7911 | 20.5716 | | MBartForCausalLM | 4 | 1.3861 | 5.3043 | 7.8029 | 79.525 | 21.6829 | 21.183 | | ElectraForQuestionAnswering | 64 | 1.6233 | 5.7809 | nan | 78.281 | 20.9463 | 19.6997 | | BertForMaskedLM | 16 | 1.6375 | 5.9511 | nan | 77.4398 | 20.8012 | 20.8429 | | BertForQuestionAnswering | 16 | 1.6686 | 5.833 | 8.5261 | 78.1837 | 20.5596 | 19.2739 | | RobertaForCausalLM | 16 | 1.6595 | 5.8969 | nan | 80.1139 | 20.3979 | 19.4716 | | CamemBert | 16 | 1.5821 | 5.978 | nan | 82.5347 | 20.2924 | 18.8516 | | BartForCausalLM | 4 | 1.3478 | 5.3002 | 8.1261 | 70.8968 | 20.1944 | 20.0496 | | RobertaForQuestionAnswering | 16 | 1.6258 | 5.9443 | 8.7085 | 72.4637 | 19.1812 | 18.569 | | OPTForCausalLM | 2 | 1.4984 | 5.6821 | nan | 73.6979 | 17.9371 | 17.2825 | | GPT2ForSequenceClassification | 4 | 1.6186 | 5.6863 | nan | 63.9009 | 17.4772 | 16.4919 | | AlbertForQuestionAnswering | 4 | 1.4301 | 5.4564 | nan | 103.7739 | 15.9297 | 15.6755 | | BlenderbotSmallForCausalLM | 64 | 0.8895 | 3.7531 | 5.4163 | 52.1065 | 14.6554 | 14.0155 | | Speech2Text2ForCausalLM | 256 | 0.8067 | 2.8565 | 4.4776 | 36.3033 | 13.6391 | 12.6567 | | DistillGPT2 | 16 | 0.8903 | 2.9522 | nan | 33.6215 | 13.5636 | 13.2986 | | PLBartForCausalLM | 8 | 0.7611 | 2.8047 | nan | 45.3473 | 13.5207 | 12.511 | | DistilBertForMaskedLM | 128 | 0.708 | 2.9024 | nan | 45.5734 | 11.3984 | 11.1229 | | DistilBertForQuestionAnswering | 256 | 0.7875 | 2.9502 | nan | 43.6371 | 10.9069 | 10.618 | | BlenderbotForCausalLM | 4 | 2.5195 | 10.4745 | nan | 153.2708 | nan | 38.6045 | | TrOCRForCausalLM | 32 | 1.2831 | 5.1844 | nan | 71.2263 | nan | 19.699 | | AlbertForMaskedLM | 4 | 1.4137 | 5.3723 | nan | 106.3506 | nan | 15.9873 | | AllenaiLongformerBase | 4 | 6.4066 | 14.037 | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0 | 0.9092 | nan | 1.1724 | 1.0595 | 1.1588 | | XLNetLMHeadModel | 8 | 1.0 | 0.9323 | nan | nan | 0.9946 | 0.9946 | | GoogleFnet | 16 | 0.9224 | 0.9224 | nan | 1.4614 | 0.9608 | 1.2768 | | PLBartForConditionalGeneration | 4 | 0.9999 | 0.9344 | nan | 1.274 | 0.9316 | 1.2234 | | OPTForCausalLM | 2 | 1.0001 | 0.9258 | nan | 1.0746 | 0.9068 | 1.1143 | | YituTechConvBert | 16 | 0.9966 | 0.9341 | nan | 0.9891 | 0.894 | 0.9822 | | DistillGPT2 | 16 | 1.0 | 0.8855 | nan | 1.055 | 0.8939 | 1.0108 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | 0.8646 | 1.4307 | | PegasusForConditionalGeneration | 32 | 0.9981 | 0.9529 | nan | 1.1152 | 0.8637 | 1.0262 | | M2M100ForConditionalGeneration | 16 | 0.9896 | 0.9328 | 0.3904 | 1.0162 | 0.8606 | 1.0096 | | PLBartForCausalLM | 8 | 1.0 | 0.8896 | nan | 1.0988 | 0.8367 | 1.0581 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9742 | 0.8157 | 0.9642 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | T5Small | 4 | 1.0 | 0.9597 | 0.3543 | 0.9821 | 0.8129 | 1.1049 | | ElectraForCausalLM | 32 | 0.9983 | 0.883 | nan | 0.844 | 0.7929 | 0.9036 | | MBartForConditionalGeneration | 2 | 1.0 | 0.8931 | nan | 0.9681 | 0.7896 | 0.9837 | | PegasusForCausalLM | 32 | 0.9593 | 0.8885 | 0.3909 | 0.9964 | 0.7774 | 0.9692 | | MT5ForConditionalGeneration | 16 | 1.0014 | 0.8793 | 0.4388 | 0.9365 | 0.7748 | 0.9324 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.7734 | 0.958 | | MegatronBertForQuestionAnswering | 8 | 1.0 | 0.9223 | nan | 1.0616 | 0.7709 | 1.0379 | | MegatronBertForCausalLM | 4 | 1.0 | 0.9018 | 0.3475 | 0.9999 | 0.7673 | 1.0153 | | MBartForCausalLM | 4 | 1.0 | 0.9122 | 0.3642 | 1.0011 | 0.7326 | 0.9478 | | RobertaForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0274 | | BertForQuestionAnswering | 16 | 1.0 | 0.9348 | 0.3313 | 1.1121 | 0.7273 | 1.0273 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.7189 | 1.0294 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.7054 | 1.0297 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | 0.695 | 0.9772 | | BertForMaskedLM | 16 | 1.0 | 0.9408 | nan | 0.9928 | 0.6945 | 0.9772 | | CamemBert | 16 | 1.0 | 0.9388 | nan | 0.987 | 0.6942 | 0.9746 | | RobertaForCausalLM | 16 | 1.0 | 0.9405 | nan | 0.9926 | 0.6942 | 0.9771 | | Speech2Text2ForCausalLM | 256 | 0.9545 | 0.8398 | 0.3515 | 0.9068 | 0.675 | 0.9168 | | DistilBertForQuestionAnswering | 256 | 1.0 | 0.9602 | nan | 1.1897 | 0.6589 | 0.9118 | | DistilBertForMaskedLM | 128 | 1.0 | 0.8847 | nan | 0.8827 | 0.6509 | 0.9194 | | DebertaV2ForMaskedLM | 1 | 1.0 | 0.9651 | nan | nan | 0.5682 | 0.9491 | | MobileBertForMaskedLM | 64 | 1.0 | 0.906 | 0.3175 | nan | 0.4951 | 0.6649 | | DebertaV2ForQuestionAnswering | 2 | 0.9842 | 0.9842 | nan | 0.9842 | 0.4735 | 0.984 | | MobileBertForQuestionAnswering | 128 | 1.0 | 0.9909 | nan | nan | 0.4145 | 0.535 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | 0.9719 | 0.3862 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9637 | 1.042 | 0.3072 | 1.1342 | 0.2902 | 1.1339 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | nan | 1.3563 | | TrOCRForCausalLM | 32 | 1.0 | 0.8787 | nan | 0.9998 | nan | 0.9239 | | BlenderbotForCausalLM | 4 | 1.0001 | 0.8057 | nan | 0.8218 | nan | 0.8509 | | AllenaiLongformerBase | 4 | 0.999 | 0.9336 | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForQuestionAnswering | 4 | 381.6389 | 381.5768 | nan | 309.5473 | 304.016 | 304.4722 | | XLNetLMHeadModel | 8 | 377.4474 | 386.651 | nan | nan | 221.9556 | 221.5451 | | PegasusForConditionalGeneration | 32 | 177.067 | 179.887 | nan | 180.1232 | 173.7163 | 173.9517 | | MegatronBertForQuestionAnswering | 8 | 172.0998 | 173.5687 | nan | 157.511 | 141.6532 | 142.9868 | | BartForConditionalGeneration | 2 | 150.0858 | 152.0996 | nan | 336.2397 | 137.4719 | 138.4262 | | MBartForConditionalGeneration | 2 | 149.9556 | 151.9886 | nan | 146.7774 | 136.947 | 138.0473 | | YituTechConvBert | 16 | 155.4149 | 160.4101 | nan | 154.8507 | 132.6964 | 132.4286 | | MobileBertForQuestionAnswering | 128 | 139.5984 | 149.8426 | nan | nan | 125.8986 | 129.0089 | | DistilBertForQuestionAnswering | 256 | 144.4217 | 144.4494 | nan | 183.1548 | 125.1141 | 125.6088 | | MobileBertForMaskedLM | 64 | 146.2439 | 161.1292 | 188.4204 | nan | 120.8138 | 127.1054 | | DistilBertForMaskedLM | 128 | 122.0352 | 128.7136 | nan | 155.838 | 120.7518 | 118.368 | | CamemBert | 16 | 135.4751 | 139.9499 | nan | 127.8592 | 116.1891 | 115.3491 | | BlenderbotSmallForConditionalGeneration | 64 | 118.9823 | 126.6696 | nan | 125.0165 | 115.339 | 114.4027 | | LayoutLMForMaskedLM | 16 | 137.0272 | 141.0251 | nan | 128.9937 | 115.1043 | 114.43 | | BertForMaskedLM | 16 | 134.0957 | 138.4667 | nan | 127.821 | 114.4328 | 114.2152 | | DebertaV2ForQuestionAnswering | 2 | 121.2369 | 137.759 | nan | 172.9436 | 114.32 | 116.6213 | | BartForCausalLM | 4 | 123.436 | 127.85 | 162.5034 | 123.3015 | 114.0983 | 113.0763 | | MBartForCausalLM | 4 | 123.6282 | 127.5867 | 163.4636 | 123.4196 | 113.7533 | 112.5581 | | RobertaForCausalLM | 16 | 142.4906 | 146.6789 | nan | 135.4823 | 112.3951 | 111.9095 | | M2M100ForConditionalGeneration | 16 | 111.1891 | 122.2312 | 148.5964 | 134.0952 | 108.4947 | 112.0259 | | PLBartForConditionalGeneration | 4 | 121.3544 | 126.461 | nan | 126.5492 | 104.9908 | 104.4883 | | PLBartForCausalLM | 8 | 118.3603 | 122.0588 | nan | 120.7153 | 102.1076 | 99.6411 | | OPTForCausalLM | 2 | 170.1438 | 181.5889 | nan | 202.0833 | 93.9133 | 93.2046 | | DebertaV2ForMaskedLM | 1 | 102.4214 | 135.5547 | nan | nan | 89.9287 | 118.9807 | | PegasusForCausalLM | 32 | 85.4652 | 89.483 | 117.039 | 90.1886 | 88.3182 | 87.1189 | | ElectraForQuestionAnswering | 64 | 124.7979 | 128.2246 | nan | 106.0463 | 87.517 | 88.5626 | | LayoutLMForSequenceClassification | 16 | 113.1671 | 114.4853 | 153.6689 | 101.4269 | 86.5067 | 87.5717 | | RobertaForQuestionAnswering | 16 | 110.9254 | 113.6573 | 151.5027 | 99.4895 | 86.4746 | 87.202 | | BertForQuestionAnswering | 16 | 110.5852 | 111.7571 | 150.9405 | 100.2519 | 86.2743 | 87.0654 | | MegatronBertForCausalLM | 4 | 101.77 | 103.5955 | 140.9664 | 95.8013 | 84.3644 | 85.22 | | DistillGPT2 | 16 | 120.5819 | 126.6815 | nan | 131.4497 | 83.9182 | 81.3049 | | DebertaForQuestionAnswering | 8 | 82.3846 | 84.9567 | 120.0207 | 98.0728 | 78.2228 | 67.2035 | | ElectraForCausalLM | 32 | 105.6167 | 113.5211 | nan | 103.9784 | 74.9685 | 72.9815 | | T5ForConditionalGeneration | 4 | 104.1914 | 111.2592 | 143.7951 | 94.7738 | 73.8813 | 73.6031 | | T5Small | 4 | 104.3149 | 111.6766 | 143.8001 | 94.4594 | 73.8523 | 73.7287 | | GoogleFnet | 16 | 101.63 | 101.7431 | nan | 65.9405 | 70.2817 | 65.1865 | | XGLMForCausalLM | 8 | 79.1836 | 84.417 | 108.645 | 257.5168 | 70.0258 | 69.6314 | | BlenderbotSmallForCausalLM | 64 | 64.6798 | 72.159 | 94.8383 | 71.1219 | 67.699 | 65.4643 | | Speech2Text2ForCausalLM | 256 | 63.9902 | 69.5891 | 98.0329 | 68.1115 | 64.7555 | 62.2572 | | MT5ForConditionalGeneration | 16 | 88.4037 | 95.9477 | 96.9813 | 89.1061 | 58.0631 | 59.962 | | GPT2ForSequenceClassification | 4 | 102.2258 | 104.5188 | nan | 145.6993 | 57.5968 | 58.0014 | | DebertaForMaskedLM | 4 | 68.1679 | 75.5032 | 84.286 | 93.3816 | 55.9358 | 58.2796 | | AlbertForMaskedLM | 4 | 384.2948 | 385.0101 | nan | 313.0812 | nan | 307.9317 | | TrOCRForCausalLM | 32 | 167.3431 | 174.4871 | nan | 172.8569 | nan | 163.0042 | | BlenderbotForCausalLM | 4 | 92.6427 | 94.5667 | nan | 97.5011 | nan | 92.2562 | | AllenaiLongformerBase | 4 | 249.9404 | 262.5336 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9989 | 0.9733 | 0.8243 | 1.2828 | 1.8679 | 1.8249 | | lcnet_050 | 128 | 0.9565 | 0.9441 | 0.7668 | 1.3431 | 1.6399 | 1.6132 | | regnety_002 | 128 | 0.9769 | 1.004 | 0.8634 | 0.9586 | 1.5054 | 1.3487 | | hrnet_w18 | 128 | 0.9999 | 0.9977 | 0.0 | 1.2728 | 1.4119 | 1.3768 | | dla102 | 128 | 1.0 | 1.0006 | 0.0 | 1.2831 | 1.3825 | 1.3671 | | volo_d1_224 | 64 | 0.9997 | 0.9957 | 0.8006 | 0.0 | 1.3789 | 1.3641 | | res2net50_14w_8s | 128 | 0.9998 | 0.9992 | 0.0 | 1.2517 | 1.3547 | 1.3231 | | coat_lite_mini | 128 | 0.9999 | 0.9988 | 0.8458 | 1.0939 | 1.3506 | 1.3585 | | xcit_large_24_p8_224 | 5 | 1.0033 | 0.9978 | 0.7966 | 0.0 | 1.3472 | 1.3092 | | mobilenetv3_large_100 | 128 | 0.9647 | 0.9631 | 0.7638 | 1.2801 | 1.3353 | 1.3475 | | mobilenetv2_100 | 128 | 0.9661 | 0.9641 | 0.7057 | 1.2793 | 1.3293 | 1.3552 | | adv_inception_v3 | 128 | 0.9999 | 0.9986 | 0.0 | 1.1286 | 1.3259 | 1.3079 | | inception_v3 | 128 | 1.0 | 0.9992 | 0.0 | 1.1287 | 1.3257 | 1.3069 | | gluon_inception_v3 | 128 | 0.9999 | 0.9986 | 0.0 | 1.1286 | 1.3255 | 1.3033 | | crossvit_9_240 | 128 | 0.9994 | 0.9984 | 0.7604 | 1.0401 | 1.3233 | 1.2994 | | resnest101e | 64 | 0.9999 | 1.002 | 0.0 | 1.1679 | 1.3144 | 1.2683 | | res2next50 | 128 | 1.0 | 1.0004 | 0.0 | 1.1808 | 1.3098 | 1.2735 | | fbnetv3_b | 128 | 0.9654 | 0.961 | 0.7618 | 1.2405 | 1.2821 | 1.2877 | | botnet26t_256 | 128 | 0.9848 | 0.985 | 0.7897 | 0.0 | 1.2753 | 1.272 | | gmixer_24_224 | 128 | 1.0 | 0.8353 | 0.0 | 1.0833 | 1.267 | 1.2498 | | selecsls42b | 128 | 0.9997 | 0.9973 | 0.8149 | 1.2141 | 1.2661 | 1.2524 | | eca_botnext26ts_256 | 128 | 0.9865 | 0.7726 | 0.0 | 0.0 | 1.2632 | 1.2534 | | sebotnet33ts_256 | 64 | 0.9766 | 0.8074 | 0.0 | 0.0 | 1.2616 | 1.282 | | tf_efficientnet_b0 | 128 | 0.9774 | 0.7835 | 0.0 | 1.1642 | 1.2593 | 1.2683 | | mnasnet_100 | 128 | 0.9668 | 0.9611 | 0.7869 | 1.2448 | 1.2591 | 1.2796 | | eca_halonext26ts | 128 | 0.9873 | 0.7789 | 0.0 | 0.0 | 1.2541 | 1.2384 | | fbnetc_100 | 128 | 0.9669 | 0.9619 | 0.7884 | 1.232 | 1.2495 | 1.2638 | | ese_vovnet19b_dw | 128 | 0.9794 | 0.9774 | 0.7448 | 1.1487 | 1.2426 | 1.2454 | | jx_nest_base | 32 | 0.9996 | 0.9929 | 0.7371 | 0.0 | 1.2357 | 1.2265 | | cspdarknet53 | 64 | 0.9581 | 0.9508 | 0.7362 | 1.1736 | 1.2338 | 1.246 | | spnasnet_100 | 128 | 0.9618 | 0.9591 | 0.7717 | 1.2221 | 1.2267 | 1.2532 | | res2net101_26w_4s | 64 | 0.9997 | 0.9957 | 0.7718 | 1.0998 | 1.226 | 1.201 | | cait_m36_384 | 4 | 0.9997 | 0.999 | 0.0 | 0.0 | 1.2132 | 1.1926 | | rexnet_100 | 128 | 0.973 | 0.8156 | 0.0 | 1.1614 | 1.2123 | 1.2175 | | convit_base | 64 | 0.9997 | 0.9987 | 0.0 | 0.0 | 1.2097 | 1.2099 | | gmlp_s16_224 | 128 | 0.9998 | 0.9988 | 0.0 | 1.0905 | 1.2091 | 1.2004 | | pnasnet5large | 16 | 0.9996 | 0.9976 | 0.0 | 1.0896 | 1.2086 | 1.1905 | | tinynet_a | 128 | 0.9658 | 0.7756 | 0.6204 | 1.1478 | 1.1903 | 1.199 | | dm_nfnet_f0 | 128 | 0.9995 | 0.9995 | 0.0 | 1.1367 | 1.1881 | 1.16 | | tf_mixnet_l | 128 | 0.9856 | 0.8897 | 0.0 | 1.0942 | 1.1878 | 1.1864 | | pit_b_224 | 64 | 1.0 | 0.9991 | 0.0 | 1.0315 | 1.1862 | 1.1748 | | dpn107 | 32 | 0.9587 | 0.9482 | 0.7793 | 1.0251 | 1.1855 | 1.2005 | | mobilevit_s | 64 | 0.9792 | 0.7621 | 0.0 | 0.0 | 1.1753 | 1.1688 | | twins_pcpvt_base | 64 | 0.9997 | 0.999 | 0.749 | 0.0 | 1.1739 | 1.1469 | | mixnet_l | 128 | 0.9845 | 0.8843 | 0.0 | 1.0988 | 1.1706 | 1.1736 | | poolformer_m36 | 64 | 0.9998 | 0.9995 | 0.0 | 0.0 | 1.1688 | 1.1506 | | repvgg_a2 | 128 | 0.9638 | 0.9615 | 0.828 | 1.1366 | 1.1687 | 1.1689 | | nfnet_l0 | 128 | 1.0 | 0.7878 | 0.0 | 1.1041 | 1.1556 | 1.12 | | swin_base_patch4_window7_224 | 64 | 1.0001 | 0.9795 | 0.0 | 0.0 | 1.1265 | 1.1141 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9815 | 0.0 | 0.0 | 1.1121 | 1.0993 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9993 | 0.0 | 1.1086 | 1.1104 | 1.0704 | | deit_base_distilled_patch16_224 | 64 | 0.9996 | 0.999 | 0.7669 | 0.9807 | 1.0968 | 1.0836 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9988 | 0.7675 | 0.9512 | 1.0862 | 1.0737 | | gluon_xception65 | 32 | 0.9997 | 0.9968 | 0.0 | 1.08 | 1.0858 | 1.0734 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0762 | 1.0735 | | gernet_l | 128 | 0.9741 | 0.9728 | 0.825 | 1.0978 | 1.0757 | 1.0699 | | mixer_b16_224 | 128 | 0.9993 | 0.998 | 0.0 | 0.8935 | 1.073 | 1.0682 | | visformer_small | 128 | 0.9997 | 1.0021 | 0.7986 | 0.0 | 1.0431 | 1.0091 | | convnext_base | 64 | 0.9997 | 0.9986 | 0.0 | 0.0 | 1.0407 | 1.0385 | | resmlp_12_224 | 128 | 0.9999 | 1.0007 | 0.6941 | 1.2124 | 0.9786 | 0.95 | | tnt_s_patch16_224 | 128 | 0.9998 | 0.9982 | 0.0 | 0.0 | 0.0 | 1.5086 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | 0.0000 | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-----------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.4756 | 26.702 | nan | 570.8057 | 116.6022 | 104.7812 | | xcit_large_24_p8_224 | 5 | 2.9556 | 15.0965 | 28.895 | nan | 80.0082 | 78.1161 | | twins_pcpvt_base | 64 | 2.2282 | 11.5241 | 19.5705 | nan | 79.1168 | 78.2481 | | swin_base_patch4_window7_224 | 64 | 3.0565 | 11.362 | nan | nan | 76.6958 | 76.0976 | | mobilevit_s | 64 | 1.958 | 6.8243 | nan | nan | 76.0377 | 77.2519 | | pnasnet5large | 16 | 4.9118 | 19.4046 | nan | 311.9781 | 73.13 | 70.2227 | | cait_m36_384 | 4 | 3.1702 | 15.9962 | nan | nan | 64.2538 | 60.8414 | | dm_nfnet_f0 | 128 | 2.2625 | 6.8374 | nan | 142.0521 | 62.2417 | 60.4557 | | resnest101e | 64 | 3.3619 | 14.0628 | nan | 254.6111 | 58.2706 | 56.7154 | | coat_lite_mini | 128 | 1.1426 | 4.6882 | 6.8927 | 92.9574 | 57.6438 | 57.094 | | res2net101_26w_4s | 64 | 3.2186 | 14.9742 | 24.8239 | 232.6651 | 54.2774 | 53.2034 | | jx_nest_base | 32 | 1.9463 | 8.652 | 13.4587 | nan | 53.2587 | 52.1241 | | eca_halonext26ts | 128 | 1.6205 | 5.028 | nan | nan | 50.9626 | 50.2019 | | res2net50_14w_8s | 128 | 2.6905 | 13.4565 | nan | 244.6253 | 49.2735 | 47.7889 | | poolformer_m36 | 64 | 1.8583 | 6.9693 | nan | nan | 48.8138 | 46.1251 | | nfnet_l0 | 128 | 2.0411 | 6.8822 | nan | 126.1996 | 48.1478 | 46.9126 | | convnext_base | 64 | 1.4192 | 5.6396 | nan | nan | 45.9069 | 44.6469 | | dpn107 | 32 | 4.3582 | 13.5357 | 32.7307 | 158.9202 | 42.0982 | 40.7838 | | sebotnet33ts_256 | 64 | 1.8274 | 5.7397 | nan | nan | 41.7765 | 40.2244 | | gmlp_s16_224 | 128 | 1.0571 | 5.891 | nan | 131.7034 | 40.9954 | 38.8211 | | crossvit_9_240 | 128 | 1.5148 | 7.1062 | 10.8681 | 148.1577 | 37.4106 | 36.8766 | | volo_d1_224 | 64 | 1.3887 | 6.8045 | 10.6998 | nan | 36.4721 | 35.101 | | fbnetv3_b | 128 | 3.2668 | 10.2558 | 25.5884 | 209.5548 | 36.305 | 35.3063 | | gluon_xception65 | 32 | 1.9317 | 9.8304 | nan | 151.3835 | 35.8462 | 34.1556 | | eca_botnext26ts_256 | 128 | 1.5437 | 4.9444 | nan | nan | 34.8142 | 34.4775 | | adv_inception_v3 | 128 | 1.6969 | 7.8193 | nan | 140.4195 | 33.2887 | 31.589 | | gluon_inception_v3 | 128 | 1.6116 | 7.4816 | nan | 141.5712 | 33.1582 | 32.2971 | | inception_v3 | 128 | 1.6286 | 7.5167 | nan | 148.2791 | 33.025 | 31.7048 | | ghostnet_100 | 128 | 2.8964 | 8.9049 | 13.2465 | 158.0201 | 32.9463 | 31.2167 | | gmixer_24_224 | 128 | 1.1783 | 6.4719 | nan | 121.9812 | 31.6746 | 29.8052 | | mixnet_l | 128 | 3.2368 | 9.6668 | nan | 152.3955 | 31.4656 | 29.4135 | | tf_mixnet_l | 128 | 3.5944 | 9.9241 | nan | 152.6773 | 31.2211 | 29.8027 | | dla102 | 128 | 1.8314 | 8.6229 | nan | 167.7114 | 31.0995 | 29.599 | | botnet26t_256 | 128 | 1.484 | 4.1052 | 8.4761 | nan | 30.9319 | 30.2954 | | swsl_resnext101_32x16d | 32 | 1.7965 | 8.1434 | nan | 125.2937 | 30.497 | 29.3371 | | convit_base | 64 | 1.0561 | 5.0025 | nan | nan | 28.9359 | 27.2984 | | res2next50 | 128 | 1.5981 | 7.3342 | nan | 154.011 | 28.1216 | 27.2235 | | rexnet_100 | 128 | 2.063 | 6.7491 | nan | 142.8165 | 27.0464 | 26.5678 | | tinynet_a | 128 | 2.1581 | 7.316 | 16.8904 | 141.4605 | 26.2701 | 25.0101 | | cspdarknet53 | 64 | 2.3764 | 6.7987 | 16.604 | 120.8281 | 23.3994 | 22.188 | | spnasnet_100 | 128 | 2.1011 | 6.0118 | 15.1699 | 108.1621 | 23.2212 | 21.1145 | | convmixer_768_32 | 32 | 1.1868 | 5.4494 | nan | nan | 22.9314 | 21.4181 | | tf_efficientnet_b0 | 128 | 1.9102 | 6.2836 | nan | 127.7187 | 22.8865 | 22.1728 | | mixer_b16_224 | 128 | 0.6092 | 2.7467 | nan | 69.9051 | 22.1631 | 21.657 | | fbnetc_100 | 128 | 2.1547 | 6.3502 | 15.8057 | 114.5417 | 22.1127 | 21.3895 | | visformer_small | 128 | 1.0006 | 3.7761 | 5.6035 | nan | 22.0157 | 21.3972 | | resmlp_12_224 | 128 | 0.6415 | 2.5142 | 4.1729 | 33.5147 | 21.0957 | 20.9601 | | pit_b_224 | 64 | 1.0378 | 4.2955 | nan | 96.1925 | 21.0688 | 19.9988 | | beit_base_patch16_224 | 64 | 1.3742 | 4.9599 | nan | nan | 20.6552 | 19.6783 | | mobilenetv3_large_100 | 128 | 1.6358 | 5.1074 | 11.7594 | 117.4029 | 20.5149 | 19.7168 | | deit_base_distilled_patch16_224 | 64 | 0.9344 | 3.8873 | 6.2432 | 73.1047 | 19.8005 | 19.4375 | | mnasnet_100 | 128 | 1.6963 | 5.081 | 11.8104 | 90.5821 | 19.5754 | 17.5309 | | mobilenetv2_100 | 128 | 1.7722 | 5.0977 | 12.5256 | 97.8267 | 19.4847 | 18.5104 | | vit_base_patch16_224 | 64 | 0.9324 | 3.8899 | 5.9607 | 75.4057 | 19.458 | 18.85 | | repvgg_a2 | 128 | 2.1085 | 5.6808 | 14.8521 | 159.5345 | 18.9088 | 18.17 | | gernet_l | 128 | 2.1571 | 5.656 | 13.8864 | 87.8723 | 18.199 | 17.3765 | | regnety_002 | 128 | 1.6783 | 5.0509 | 11.2638 | 92.4051 | 18.1054 | 17.1211 | | selecsls42b | 128 | 0.8371 | 3.4511 | 5.2309 | 75.7339 | 15.8576 | 15.3927 | | lcnet_050 | 128 | 1.1107 | 3.2303 | 6.7754 | 67.1498 | 13.9044 | 12.8923 | | ese_vovnet19b_dw | 128 | 1.0312 | 2.7948 | 6.1005 | 55.9446 | 12.7815 | 12.1307 | | tnt_s_patch16_224 | 128 | 1.7539 | 9.1117 | nan | nan | nan | 33.7155 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 1.6177 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | 0.9898 | 1.351 | 1.5843 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1772 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1378 | 1.2737 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1376 | 1.2529 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.7757 | 1.1264 | 1.3578 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1133 | 1.1802 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0528 | 1.0689 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 1.2304 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.9927 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 1.2337 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | nan | 0.9882 | 1.0887 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 0.9853 | 1.1265 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 1.4726 | 0.985 | 1.0539 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 1.0153 | 0.9766 | 0.9828 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9765 | 1.1445 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | 0.3296 | nan | 0.9653 | 1.0595 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.953 | 0.9633 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 0.952 | 1.0925 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9468 | 1.1098 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.7593 | 0.9435 | 1.0967 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.942 | 0.988 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8647 | 0.938 | 1.0123 | | jx_nest_base | 32 | 1.0002 | 0.8966 | 0.2863 | nan | 0.9348 | 1.0603 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9285 | 1.0154 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.9152 | 0.9655 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0634 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9138 | 1.0636 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0636 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0516 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9065 | 1.0615 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.894 | 0.9058 | 0.9838 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8525 | 1.0752 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.8485 | 1.0335 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | nan | 0.8189 | 0.9416 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | 1.3763 | 0.8169 | 0.8253 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.9856 | 0.8154 | 1.0235 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9926 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6571 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.7449 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | 0.282 | 1.1222 | 0.6745 | 0.9137 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8633 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 364.7685 | 364.8565 | nan | nan | 339.4212 | 339.5381 | | hrnet_w18 | 128 | 416.9164 | 417.4999 | nan | 326.7599 | 294.6548 | 302.5669 | | convnext_base | 64 | 263.4482 | 264.0711 | nan | nan | 253.4351 | 253.2399 | | pnasnet5large | 16 | 289.2864 | 289.7427 | nan | 265.1101 | 239.4172 | 243.0997 | | tf_mixnet_l | 128 | 256.7006 | 284.549 | nan | 231.118 | 213.0943 | 213.3198 | | swin_base_patch4_window7_224 | 64 | 237.0367 | 242.1293 | nan | nan | 210.5258 | 212.6132 | | mixnet_l | 128 | 247.4092 | 275.3706 | nan | 221.5611 | 208.1897 | 207.6151 | | swsl_resnext101_32x16d | 32 | 219.2517 | 219.8235 | nan | 198.13 | 198.2839 | 204.9308 | | dla102 | 128 | 269.5024 | 269.2413 | nan | 209.9942 | 194.8145 | 196.9871 | | cait_m36_384 | 4 | 216.9907 | 216.7021 | nan | nan | 178.9326 | 181.8339 | | resnest101e | 64 | 229.9588 | 229.1705 | nan | 196.8235 | 175.3676 | 181.3803 | | dm_nfnet_f0 | 128 | 206.3026 | 205.687 | nan | 181.0582 | 173.055 | 177.4365 | | gluon_inception_v3 | 128 | 226.5355 | 227.1061 | nan | 200.7527 | 170.794 | 173.7007 | | inception_v3 | 128 | 226.4406 | 226.5753 | nan | 200.6574 | 170.7906 | 173.209 | | adv_inception_v3 | 128 | 226.3375 | 226.8494 | nan | 200.4902 | 170.7829 | 173.0918 | | res2net50_14w_8s | 128 | 229.3406 | 229.4974 | nan | 183.3019 | 169.2226 | 173.2039 | | gluon_xception65 | 32 | 182.9118 | 183.085 | nan | 168.8361 | 168.1194 | 170.0764 | | convit_base | 64 | 196.4916 | 196.5551 | nan | nan | 162.5045 | 162.2467 | | res2next50 | 128 | 206.5705 | 206.5063 | nan | 175.3999 | 157.8124 | 162.5408 | | dpn107 | 32 | 191.1037 | 192.8055 | 234.7882 | 178.5375 | 154.2675 | 152.3627 | | nfnet_l0 | 128 | 175.9404 | 223.8138 | nan | 159.3266 | 152.2736 | 156.8451 | | gernet_l | 128 | 165.1883 | 165.4525 | 193.5618 | 146.3476 | 149.5938 | 150.2617 | | poolformer_m36 | 64 | 174.6063 | 174.7309 | nan | nan | 149.4226 | 151.7588 | | mixer_b16_224 | 128 | 158.8231 | 158.7491 | nan | 177.1866 | 148.2321 | 148.8556 | | coat_lite_mini | 128 | 191.7547 | 191.7765 | 226.6466 | 175.253 | 141.9589 | 140.989 | | pit_b_224 | 64 | 158.5438 | 158.5401 | nan | 153.5621 | 133.6889 | 134.8539 | | eca_halonext26ts | 128 | 169.2091 | 214.5469 | nan | nan | 133.0879 | 135.0166 | | eca_botnext26ts_256 | 128 | 163.2372 | 208.7425 | nan | nan | 127.5217 | 128.6211 | | gmlp_s16_224 | 128 | 152.0876 | 152.2203 | nan | 139.4356 | 125.91 | 126.5958 | | res2net101_26w_4s | 64 | 152.1679 | 152.0514 | 196.6326 | 137.601 | 123.7417 | 127.6215 | | visformer_small | 128 | 128.4044 | 127.9026 | 160.7974 | nan | 123.1748 | 127.1464 | | fbnetv3_b | 128 | 162.4607 | 163.1637 | 205.8877 | 126.3347 | 122.3261 | 121.824 | | botnet26t_256 | 128 | 152.3326 | 152.2524 | 189.9517 | nan | 117.6174 | 117.9262 | | twins_pcpvt_base | 64 | 137.1407 | 137.3877 | 182.8849 | nan | 116.8401 | 119.5368 | | gmixer_24_224 | 128 | 146.3916 | 175.2358 | nan | 135.0674 | 115.7041 | 116.9999 | | beit_base_patch16_224 | 64 | 128.5192 | 130.8376 | nan | nan | 115.543 | 116.8268 | | volo_d1_224 | 64 | 153.5624 | 154.0392 | 192.0321 | nan | 111.4682 | 112.3729 | | vit_base_patch16_224 | 64 | 119.1809 | 119.2246 | 155.3422 | 125.1089 | 110.3061 | 111.288 | | deit_base_distilled_patch16_224 | 64 | 119.877 | 119.9558 | 156.3764 | 122.1719 | 109.9522 | 111.1186 | | repvgg_a2 | 128 | 127.2862 | 127.4791 | 146.3677 | 107.8353 | 104.949 | 104.759 | | tf_efficientnet_b0 | 128 | 133.8315 | 166.9492 | nan | 112.2655 | 103.9038 | 103.1134 | | xcit_large_24_p8_224 | 5 | 134.7408 | 136.3769 | 174.3425 | nan | 102.0291 | 104.8238 | | cspdarknet53 | 64 | 130.3179 | 131.2615 | 169.5526 | 106.4497 | 101.2016 | 100.094 | | jx_nest_base | 32 | 121.5441 | 122.0372 | 164.4998 | nan | 98.2074 | 98.7616 | | mobilevit_s | 64 | 117.0262 | 150.345 | nan | nan | 97.5298 | 98.0344 | | rexnet_100 | 128 | 119.1607 | 142.0498 | nan | 99.7554 | 95.6362 | 95.1322 | | fbnetc_100 | 128 | 123.1929 | 123.8786 | 151.3854 | 96.7891 | 95.4265 | 94.3299 | | tinynet_a | 128 | 109.8996 | 137.0024 | 171.3408 | 92.5508 | 89.2134 | 88.5995 | | sebotnet33ts_256 | 64 | 114.3651 | 138.387 | nan | nan | 88.5371 | 87.1093 | | spnasnet_100 | 128 | 105.7082 | 106.0901 | 131.9371 | 83.2707 | 83.0008 | 81.1741 | | ese_vovnet19b_dw | 128 | 99.5701 | 99.7387 | 130.8873 | 84.8795 | 78.4736 | 78.2217 | | mnasnet_100 | 128 | 98.5817 | 99.2107 | 121.2448 | 76.5818 | 75.8076 | 74.3997 | | crossvit_9_240 | 128 | 98.4895 | 98.515 | 129.569 | 94.6375 | 74.4851 | 75.8095 | | resmlp_12_224 | 128 | 71.1621 | 71.1429 | 102.6623 | 58.6743 | 72.9277 | 74.9741 | | mobilenetv2_100 | 128 | 97.5753 | 97.7786 | 133.6191 | 73.7276 | 70.9768 | 69.5273 | | selecsls42b | 128 | 89.6548 | 89.8646 | 109.9877 | 73.8062 | 70.7392 | 71.5968 | | mobilenetv3_large_100 | 128 | 85.5493 | 85.6857 | 108.2155 | 64.4879 | 61.9179 | 61.2035 | | ghostnet_100 | 128 | 114.5882 | 117.6969 | 139.1523 | 89.2151 | 61.427 | 62.7548 | | regnety_002 | 128 | 53.142 | 51.8497 | 60.7109 | 54.5029 | 35.0884 | 41.5974 | | lcnet_050 | 128 | 38.2867 | 38.7731 | 47.7544 | 27.25 | 22.3091 | 22.6699 | | tnt_s_patch16_224 | 128 | 469.7881 | 470.7283 | nan | nan | nan | 311.4072 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

bench_logs/timm_models_float32.png : ![](https://i.imgur.com/udASePJ.png) bench_logs/torchbench_float32.png : ![](https://i.imgur.com/mHTHzQx.png) bench_logs/huggingface_float32.png : ![](https://i.imgur.com/Rw4Iyh3.png)

pytorch / pytorch

TorchDynamo Performance DashBoard #93794

Compilation Profile

Compilation Latency

Peak Memory

Number of graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Metrics over time

huggingface suite with float32 precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

Metrics over time

huggingface suite with amp precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

torchbench suite with amp precision

huggingface suite with amp precision

timm_models suite with amp precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for amp precision

Executive Summary

torchbench suite with amp precision

huggingface suite with amp precision