williamwen42 commented 2 years ago

Comment

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 40/55 | 47%, 20/43  | 39%, 24/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/43   | 89%, 54/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 91%, 50/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.48x    |    1.28x    |    1.25x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.08    |    2.22     |    1.88     |
|       aot_eager        |    6.92    |    9.05     |    8.70     |
|     aot_cudagraphs     |    8.23    |    18.64    |    15.25    |
|      aot_nvfuser       |   20.32    |    9.60     |    50.01    |
|        inductor        |   62.17    |    52.98    |    73.89    |
| inductor_no_cudagraphs |   64.61    |    49.17    |    72.74    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.82x    |    0.72x    |    0.97x    |
| inductor_no_cudagraphs |   0.94x    |    0.96x    |    1.02x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.818 | 0.9452 | | torchbench | dlrm | 1.0006 | 0.0 | | torchbench | nvidia_deeprecommender | 0.904 | 0.9643 | | torchbench | hf_GPT2_large | 0.0 | 1.3706 | | torchbench | hf_T5 | 0.0 | 1.5515 | | torchbench | tacotron2 | 0.0 | 0.9362 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9499 | 0.9719 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5436 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------+----------+------------------------+ | torchbench | timm_efficientdet | 484.0577 | 488.767 | | torchbench | yolov3 | 419.4861 | 419.8955 | | torchbench | hf_T5_large | 205.3317 | 202.2279 | | torchbench | timm_vision_transformer | 153.43 | 160.5928 | | torchbench | speech_transformer | 152.3735 | 147.9389 | | torchbench | timm_resnest | 150.1654 | 145.0659 | | torchbench | attention_is_all_you_need_pytorch | 137.7387 | 139.7203 | | torchbench | timm_vision_transformer_large | 126.2802 | 123.9619 | | torchbench | dlrm | 3.4517 | nan | | torchbench | hf_GPT2_large | nan | 143.1625 | | torchbench | tacotron2 | nan | 106.378 | | torchbench | hf_T5 | nan | 44.804 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | XGLMForCausalLM | 203.4086 | 201.0863 | | huggingface | DebertaForMaskedLM | 163.7151 | 106.9608 | | huggingface | DebertaForQuestionAnswering | 152.0741 | 118.2059 | | huggingface | M2M100ForConditionalGeneration | 128.0751 | 124.2115 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | twins_pcpvt_base | 431.1592 | 426.4103 | | timm_models | coat_lite_mini | 362.4216 | 372.6703 | | timm_models | mobilevit_s | 233.8428 | 237.9062 | | timm_models | eca_halonext26ts | 204.8437 | 207.0974 | | timm_models | sebotnet33ts_256 | 185.8238 | 191.2608 | | timm_models | eca_botnext26ts_256 | 179.8768 | 176.7545 | | timm_models | swin_base_patch4_window7_224 | 177.0112 | 174.7488 | | timm_models | xcit_large_24_p8_224 | 172.3324 | 164.8544 | | timm_models | jx_nest_base | 155.4547 | 156.5451 | | timm_models | convnext_base | 133.0295 | 129.8216 | | timm_models | cait_m36_384 | 132.7509 | 130.12 | | timm_models | tnt_s_patch16_224 | nan | 50.0197 | +-------------+-----------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 0.9697 | | torchbench | speech_transformer | 0.896 | 0.8996 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.8848 | 0.9654 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.8964 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8616 | 1.0285 | | torchbench | pytorch_unet | 0.859 | 0.8608 | | torchbench | resnet50 | 0.8564 | 0.8913 | | torchbench | densenet121 | 0.8562 | 0.9307 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | hf_Bart | 0.8503 | 1.1284 | | torchbench | fastNLP_Bert | 0.8354 | 1.0952 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.825 | 1.0689 | | torchbench | hf_BigBird | 0.8211 | 1.0393 | | torchbench | dcgan | 0.767 | 0.7903 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | timm_vision_transformer | 0.7478 | 0.8187 | | torchbench | alexnet | 0.743 | 0.8332 | | torchbench | timm_vovnet | 0.7286 | 0.7339 | | torchbench | LearningToPaint | 0.7133 | 0.7462 | | torchbench | hf_Bert | 0.7048 | 0.985 | | torchbench | dlrm | 0.7035 | nan | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | hf_DistilBert | 0.6596 | 0.9466 | | torchbench | vgg16 | 0.6471 | 0.6497 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4682 | 0.6183 | | torchbench | pytorch_struct | 0.4222 | 0.429 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4212 | | torchbench | hf_Reformer | 0.299 | 0.9882 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | tacotron2 | nan | 1.1496 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | BigBird | 0.8224 | 1.0108 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8173 | 0.9383 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.808 | 0.8738 | | huggingface | BartForConditionalGeneration | 0.7817 | 0.9515 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | M2M100ForConditionalGeneration | 0.7712 | 1.016 | | huggingface | GoogleFnet | 0.7698 | 0.9373 | | huggingface | MT5ForConditionalGeneration | 0.7623 | 0.9396 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7492 | 0.9186 | | huggingface | PLBartForConditionalGeneration | 0.7397 | 0.9638 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0246 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9248 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.8801 | | huggingface | OPTForCausalLM | 0.6761 | 0.8847 | | huggingface | ElectraForCausalLM | 0.6731 | 0.905 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8993 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8975 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.4267 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.3264 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 0.9192 | | timm_models | adv_inception_v3 | 0.8983 | 0.9073 | | timm_models | gluon_inception_v3 | 0.8983 | 0.9073 | | timm_models | inception_v3 | 0.8983 | 0.9073 | | timm_models | mnasnet_100 | 0.8961 | 0.9077 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9249 | | timm_models | lcnet_050 | 0.8921 | 0.923 | | timm_models | cspdarknet53 | 0.8835 | 0.8875 | | timm_models | res2net50_14w_8s | 0.881 | 0.9327 | | timm_models | regnety_002 | 0.8617 | 0.8993 | | timm_models | botnet26t_256 | 0.8605 | 0.8702 | | timm_models | pit_b_224 | 0.8417 | 1.0633 | | timm_models | fbnetc_100 | 0.8416 | 0.8498 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9711 | | timm_models | coat_lite_mini | 0.8404 | 1.0528 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.8234 | | timm_models | repvgg_a2 | 0.7684 | 0.8011 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6496 | 0.8704 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0028 | 0.9993 | 2.3219 | 1.443 | 5.4438 | 1.3058 | | timm_efficientdet | 1 | 0.9824 | 0.8845 | 0.0 | 0.0 | 4.2758 | 1.526 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.9777 | 2.1532 | 1.1969 | 3.6923 | 1.2407 | | timm_vision_transformer | 8 | 1.0068 | 0.9447 | 1.5339 | 1.3578 | 2.5716 | 1.4121 | | drq | 1 | 1.0315 | 0.8503 | 1.3708 | 1.0638 | 2.4195 | 1.0737 | | resnext50_32x4d | 8 | 1.0007 | 1.079 | 1.2092 | 1.3669 | 2.0959 | 1.2162 | | mobilenet_v3_large | 32 | 1.0078 | 1.1087 | 1.0365 | 1.3781 | 1.9864 | 1.3795 | | BERT_pytorch | 16 | 1.0104 | 0.8854 | 0.0 | 0.0 | 1.9168 | 1.9012 | | resnet18 | 16 | 1.006 | 1.1021 | 1.168 | 1.3958 | 1.8428 | 1.2045 | | pytorch_struct | 200 | 0.9977 | 0.7381 | 0.8734 | 0.8906 | 1.827 | 1.1633 | | lennard_jones | 1000 | 0.976 | 0.8293 | 1.0524 | 1.0142 | 1.818 | 0.9452 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.9377 | 1.2471 | 1.1785 | 1.7636 | 1.3013 | | squeezenet1_1 | 32 | 0.9979 | 0.9923 | 1.0527 | 1.1557 | 1.7406 | 1.2709 | | hf_Albert | 8 | 1.0015 | 0.9976 | 0.752 | 0.0 | 1.6466 | 1.6414 | | dcgan | 32 | 0.9829 | 1.0102 | 1.2585 | 1.1788 | 1.6306 | 1.0725 | | hf_T5_large | 2 | 1.0248 | 0.9068 | 0.0 | 0.0 | 1.5833 | 1.5731 | | speech_transformer | 32 | 1.0038 | 0.9068 | 0.0 | 0.0 | 1.5684 | 1.544 | | shufflenet_v2_x1_0 | 128 | 1.0005 | 1.0532 | 0.8062 | 1.1931 | 1.53 | 1.3689 | | timm_resnest | 32 | 0.9996 | 1.0027 | 0.8044 | 1.1815 | 1.5191 | 1.4517 | | timm_nfnet | 128 | 0.9993 | 0.9999 | 0.0 | 1.2122 | 1.4726 | 1.4222 | | mnasnet1_0 | 32 | 0.9993 | 1.0945 | 0.8568 | 1.2932 | 1.4577 | 1.2734 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.978 | 0.0 | 0.0 | 1.4527 | 1.4479 | | mobilenet_v2 | 96 | 0.9998 | 1.0003 | 0.7313 | 1.0443 | 1.4287 | 1.4088 | | hf_GPT2 | 4 | 1.0046 | 0.9827 | 0.738 | 0.0 | 1.4239 | 1.4306 | | soft_actor_critic | 256 | 0.9921 | 0.7715 | 1.1241 | 0.9985 | 1.4185 | 0.9565 | | resnet50_quantized_qat | 32 | 1.0019 | 0.9619 | 0.0 | 0.0 | 1.401 | 1.3947 | | fastNLP_Bert | 6 | 0.9997 | 0.9761 | 0.7528 | 0.0 | 1.3686 | 1.3445 | | timm_efficientnet | 32 | 0.9551 | 0.8076 | 0.7031 | 1.0629 | 1.3353 | 1.2011 | | LearningToPaint | 96 | 1.0048 | 1.0586 | 0.8687 | 1.2057 | 1.2627 | 1.2074 | | pytorch_unet | 1 | 1.0001 | 0.9982 | 0.8464 | 1.0765 | 1.2042 | 1.1861 | | resnet50 | 32 | 0.9994 | 0.9937 | 0.7608 | 1.1612 | 1.204 | 1.1695 | | Super_SloMo | 6 | 1.0003 | 0.9974 | 0.8669 | 0.0 | 1.18 | 1.1645 | | hf_Bart | 4 | 1.0127 | 0.9757 | 0.0 | 0.0 | 1.1721 | 1.1653 | | vgg16 | 64 | 1.0 | 0.999 | 0.859 | 0.9973 | 1.1707 | 1.1652 | | alexnet | 128 | 0.9991 | 0.998 | 0.8031 | 1.0004 | 1.163 | 1.1651 | | hf_Bert | 4 | 1.0214 | 0.944 | 0.7306 | 0.0 | 1.1575 | 1.1396 | | hf_DistilBert | 8 | 0.9999 | 0.9569 | 0.6872 | 0.0 | 1.1481 | 1.1546 | | timm_regnet | 32 | 0.9653 | 0.9617 | 0.7795 | 1.096 | 1.1283 | 1.0941 | | pytorch_stargan | 16 | 0.9997 | 0.983 | 0.866 | 0.9896 | 1.1189 | 1.0913 | | Background_Matting | 4 | 1.0006 | 1.0218 | 0.866 | 1.0816 | 1.1153 | 1.1069 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9267 | 0.0 | 1.1095 | 1.1343 | | hf_BigBird | 2 | 0.9915 | 0.939 | 0.9612 | 0.0 | 1.0921 | 1.0042 | | yolov3 | 16 | 1.0 | 0.9954 | 0.7893 | 1.1839 | 1.0795 | 1.0647 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9726 | 0.0 | 0.0 | 1.047 | 1.033 | | timm_vision_transformer_large | 8 | 0.9982 | 0.9912 | 0.0 | 0.9805 | 1.044 | 1.0331 | | tts_angular | 64 | 0.9937 | 0.964 | 0.9933 | 1.0231 | 1.0136 | 1.0218 | | timm_vovnet | 32 | 0.9102 | 0.9045 | 0.7132 | 0.9774 | 1.0069 | 1.0176 | | dlrm | 2048 | 1.0064 | 1.0734 | 0.0 | 0.0 | 1.0006 | 0.0 | | demucs | 4 | 0.9997 | 0.9998 | 0.999 | 0.9999 | 1.0 | 1.0007 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.585 | 0.942 | 0.904 | 0.9643 | | hf_GPT2_large | 4 | 1.0004 | 0.9805 | 0.0 | 0.0 | 0.0 | 1.3706 | | hf_T5 | 8 | 1.0002 | 0.9932 | 0.0 | 0.0 | 0.0 | 1.5515 | | tacotron2 | 64 | 0.981 | 0.8581 | 0.0 | 0.0 | 0.0 | 0.9362 | | hf_Longformer | 2 | 0.9701 | 0.9013 | 0.8196 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5344 | 38.4011 | nan | nan | 484.0577 | 488.767 | | yolov3 | 16 | 2.7711 | 8.6894 | 11.9084 | 43.4046 | 419.4861 | 419.8955 | | hf_T5_large | 2 | 13.2998 | 41.15 | nan | nan | 205.3317 | 202.2279 | | timm_vision_transformer | 8 | 0.7808 | 4.1474 | 5.8215 | 9.3655 | 153.43 | 160.5928 | | speech_transformer | 32 | 1.5424 | 8.2938 | nan | nan | 152.3735 | 147.9389 | | timm_resnest | 32 | 0.5383 | 2.6812 | 3.7424 | 35.1306 | 150.1654 | 145.0659 | | attention_is_all_you_need_pytorch | 256 | 1.0734 | 7.1292 | nan | nan | 137.7387 | 139.7203 | | timm_vision_transformer_large | 8 | 2.223 | 13.8751 | nan | 24.351 | 126.2802 | 123.9619 | | pytorch_stargan | 16 | 0.3789 | 2.3643 | 3.1326 | 3.9188 | 107.0355 | 104.0851 | | pytorch_struct | 200 | 0.2366 | 0.7827 | 1.3456 | 4.0715 | 99.505 | 98.1575 | | BERT_pytorch | 16 | 1.4194 | 7.614 | nan | nan | 92.0393 | 92.0811 | | fastNLP_Bert | 6 | 1.4306 | 6.6169 | 10.0451 | nan | 65.652 | 63.418 | | hf_GPT2 | 4 | 1.2488 | 6.1179 | 8.8738 | nan | 63.5447 | 63.521 | | hf_Bart | 4 | 1.3924 | 8.089 | nan | nan | 49.9676 | 49.9717 | | densenet121 | 4 | 1.9897 | 13.3477 | 20.1678 | 88.3763 | 45.0957 | 43.7205 | | mobilenet_v3_large | 32 | 0.8275 | 4.8204 | 6.7604 | 53.5764 | 44.9158 | 46.9735 | | hf_Albert | 8 | 1.0066 | 5.8746 | 8.5532 | nan | 41.987 | 41.132 | | hf_BigBird | 2 | 7.3861 | 13.5387 | 29.953 | nan | 41.2734 | 26.6352 | | resnet50_quantized_qat | 32 | 1.061 | 9.0448 | nan | nan | 39.8902 | 40.3176 | | hf_Bert | 4 | 1.312 | 6.2693 | 8.8293 | nan | 39.8395 | 38.7377 | | timm_regnet | 32 | 2.173 | 8.4238 | 20.7651 | 47.6157 | 37.2439 | 35.16 | | hf_Reformer | 4 | 2.3483 | nan | 9.1124 | nan | 36.065 | 30.7238 | | timm_efficientnet | 32 | 1.6787 | 6.665 | 16.1146 | 52.4346 | 34.2419 | 34.4653 | | mnasnet1_0 | 32 | 0.7461 | 4.4921 | 6.4014 | 30.714 | 31.0909 | 30.7546 | | resnet50 | 32 | 0.7937 | 4.9477 | 6.925 | 32.2699 | 31.0875 | 29.832 | | hf_DistilBert | 8 | 0.4278 | 3.0834 | 6.0696 | nan | 30.4362 | 29.5285 | | resnext50_32x4d | 8 | 0.8239 | 4.9203 | 6.8365 | 28.5464 | 30.2931 | 30.0266 | | timm_vovnet | 32 | 1.4222 | 4.5909 | 10.441 | 23.5649 | 30.0127 | 29.7463 | | timm_nfnet | 128 | 1.8844 | 7.7171 | nan | 29.8502 | 29.8712 | 28.8763 | | mobilenet_v2_quantized_qat | 96 | 1.1759 | 8.8754 | nan | nan | 27.0997 | 27.2946 | | functorch_dp_cifar10 | 64 | 0.3232 | 1.9699 | 2.8309 | 5.5366 | 26.1947 | 24.9937 | | resnet18 | 16 | 0.3858 | 1.8912 | 2.6752 | 17.5591 | 23.2902 | 20.4971 | | shufflenet_v2_x1_0 | 128 | 0.8656 | 5.4261 | 7.6883 | 26.8524 | 18.5748 | 17.9867 | | Super_SloMo | 6 | 0.9695 | 5.0542 | 6.7627 | nan | 17.3419 | 16.4668 | | Background_Matting | 4 | 0.6979 | 4.5367 | 6.7144 | 29.2894 | 16.7635 | 16.0163 | | mobilenet_v2 | 96 | 0.7343 | 4.4782 | 6.6781 | 37.1045 | 16.669 | 16.3002 | | pytorch_unet | 1 | 0.4223 | 2.1063 | 2.9975 | 19.6418 | 8.2272 | 7.7305 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3535 | 2.202 | 3.0539 | 3.8439 | 8.1719 | 8.0926 | | LearningToPaint | 96 | 0.4124 | 1.9651 | 2.8324 | 23.8303 | 7.2019 | 6.8944 | | squeezenet1_1 | 32 | 0.2563 | 0.9557 | 1.3863 | 4.5328 | 4.0598 | 3.8616 | | nvidia_deeprecommender | 256 | 0.1895 | 0.4298 | 0.6854 | 2.4393 | 4.0142 | 3.7143 | | drq | 1 | 0.1402 | 0.4424 | 0.8198 | 3.4662 | 3.7694 | 3.1945 | | vgg16 | 64 | 0.1869 | 0.6441 | 1.0464 | 2.4609 | 3.6811 | 3.2422 | | dlrm | 2048 | 0.4444 | 0.8198 | nan | nan | 3.4517 | nan | | soft_actor_critic | 256 | 0.2031 | 0.3372 | 0.4948 | 1.5206 | 3.0611 | 2.6231 | | alexnet | 128 | 0.1421 | 0.4161 | 0.6606 | 2.3558 | 2.9654 | 2.6911 | | dcgan | 32 | 0.1641 | 0.4494 | 0.6683 | 3.7309 | 2.678 | 2.4053 | | lennard_jones | 1000 | 0.1381 | 0.289 | 0.4429 | 1.0648 | 1.9631 | 1.736 | | tts_angular | 64 | 0.2061 | 0.2786 | 0.3976 | 1.0162 | 1.8605 | 1.6749 | | demucs | 4 | 0.2929 | 0.2934 | 0.2977 | 0.2969 | 0.2011 | 0.1967 | | hf_GPT2_large | 4 | 4.9818 | 19.3363 | nan | nan | nan | 143.1625 | | tacotron2 | 64 | 16.7009 | 28.6252 | nan | nan | nan | 106.378 | | hf_T5 | 8 | 2.1787 | 9.4406 | nan | nan | nan | 44.804 | | hf_Longformer | 2 | 5.7342 | 13.862 | 78.3703 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4314 | 1.4314 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.4036 | 1.4036 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2637 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1858 | 1.1912 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | nan | 0.9703 | 1.1094 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.8021 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.896 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.3919 | 0.9169 | 0.8848 | 0.9654 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.8964 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8616 | 1.0285 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3435 | 0.8551 | 0.8562 | 0.9307 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | hf_Bart | 4 | 0.9617 | 0.8598 | nan | nan | 0.8503 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | nan | 0.8354 | 1.0952 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3596 | 0.8203 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4301 | nan | 0.8211 | 1.0393 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.7478 | 0.8187 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6503 | 0.7133 | 0.7462 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7048 | 0.985 | | dlrm | 2048 | 0.7302 | 0.7305 | nan | nan | 0.7035 | nan | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3593 | 0.6971 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6596 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4682 | 0.6183 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1507 | | tacotron2 | 64 | 0.9906 | 1.093 | nan | nan | nan | 1.1496 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0285 | 0.9414 | 0.0 | 0.0 | 3.7345 | 1.5254 | | CamemBert | 1 | 1.0493 | 0.9732 | 1.3251 | 0.0 | 2.3889 | 1.5405 | | MT5ForConditionalGeneration | 8 | 1.0272 | 0.9263 | 0.0 | 0.0 | 2.2531 | 1.9848 | | DistillGPT2 | 1 | 1.0322 | 0.9458 | 1.0657 | 0.0 | 2.099 | 1.9009 | | MobileBertForMaskedLM | 32 | 1.023 | 0.9232 | 0.0 | 0.0 | 1.9829 | 1.574 | | GoogleFnet | 1 | 0.9985 | 0.8173 | 0.9815 | 1.1247 | 1.9188 | 1.1214 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.6662 | 1.6568 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9667 | 0.0 | 0.0 | 1.4388 | 1.4275 | | M2M100ForConditionalGeneration | 8 | 1.0412 | 0.8942 | 1.0013 | 0.0 | 1.4178 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.024 | 0.9187 | 0.0 | 0.0 | 1.4036 | 1.2789 | | ElectraForCausalLM | 32 | 1.0004 | 0.9312 | 0.0 | 0.0 | 1.3702 | 1.4028 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9844 | 0.0 | 0.0 | 1.3541 | 1.3368 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0018 | 0.0 | 0.0 | 1.2567 | 1.2522 | | AlbertForMaskedLM | 4 | 0.9993 | 0.9996 | 0.0 | 0.0 | 1.25 | 1.2519 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9892 | 0.7379 | 0.0 | 1.2473 | 1.2318 | | T5Small | 1 | 1.0191 | 0.9543 | 0.0 | 0.0 | 1.2442 | 1.2308 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9613 | 0.0 | 0.0 | 1.1874 | 1.188 | | OPTForCausalLM | 32 | 1.0037 | 0.932 | 0.0 | 0.0 | 1.1825 | 1.1983 | | XGLMForCausalLM | 8 | 1.0128 | 0.9394 | 0.0 | 0.0 | 1.1706 | 1.1753 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.971 | 0.0 | 0.0 | 1.1633 | 1.1716 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.985 | 0.7131 | 0.0 | 1.1444 | 1.1262 | | RobertaForCausalLM | 64 | 1.0004 | 0.9637 | 0.7465 | 0.0 | 1.1133 | 1.1212 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.9259 | 0.6593 | 0.0 | 1.11 | 1.1484 | | BigBird | 1 | 0.9894 | 0.937 | 0.991 | 0.0 | 1.1023 | 1.0034 | | BartForCausalLM | 4 | 1.0007 | 0.9668 | 0.0 | 0.0 | 1.0962 | 1.1067 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9887 | 0.0 | 0.0 | 1.0962 | 1.0896 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0104 | 0.7572 | 0.0 | 1.0947 | 1.0716 | | MBartForConditionalGeneration | 16 | 1.0102 | 0.9766 | 0.0 | 0.0 | 1.0887 | 1.0775 | | DebertaForMaskedLM | 4 | 0.9321 | 0.8111 | 0.7317 | 0.0 | 1.0885 | 1.0732 | | MegatronBertForCausalLM | 16 | 1.0332 | 1.0027 | 0.7578 | 0.0 | 1.087 | 1.0785 | | PegasusForConditionalGeneration | 16 | 1.0101 | 0.9819 | 0.7569 | 0.0 | 1.0857 | 1.0825 | | BertForQuestionAnswering | 128 | 0.9997 | 0.9882 | 0.0 | 0.0 | 1.0722 | 1.0661 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9942 | 0.0 | 0.0 | 1.0696 | 1.0709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9265 | 0.0 | 0.0 | 1.0628 | 1.0696 | | DebertaForQuestionAnswering | 8 | 0.9976 | 0.9917 | 0.6821 | 0.0 | 1.0623 | 1.2025 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9519 | 0.7122 | 0.0 | 1.0362 | 1.0546 | | BertForMaskedLM | 64 | 1.0003 | 0.9524 | 0.7302 | 0.0 | 1.0338 | 1.0381 | | PLBartForCausalLM | 32 | 1.0055 | 0.9348 | 0.7321 | 0.0 | 1.0224 | 1.0494 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9105 | 0.6827 | 0.0 | 1.0131 | 1.0345 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9556 | 0.0 | 0.0 | 0.9981 | 1.0096 | | MBartForCausalLM | 32 | 1.0013 | 0.9555 | 0.0 | 0.0 | 0.9967 | 1.0069 | | PegasusForCausalLM | 32 | 0.9998 | 0.953 | 0.7325 | 0.0 | 0.9888 | 1.0008 | | AllenaiLongformerBase | 1 | 0.953 | 0.7915 | 0.7884 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2364 | 12.2125 | nan | nan | 203.4086 | 201.0863 | | DebertaForMaskedLM | 4 | 4.684 | 11.0814 | 44.7781 | nan | 163.7151 | 106.9608 | | DebertaForQuestionAnswering | 8 | 4.5483 | 11.6349 | 43.993 | nan | 152.0741 | 118.2059 | | M2M100ForConditionalGeneration | 8 | 2.7543 | 15.4794 | 23.643 | nan | 128.0751 | 124.2115 | | YituTechConvBert | 1 | 2.0946 | 9.5284 | nan | nan | 115.4649 | 119.3641 | | MT5ForConditionalGeneration | 8 | 3.4744 | 13.6659 | nan | nan | 90.4534 | 91.1223 | | MobileBertForMaskedLM | 32 | 7.7855 | 27.1609 | nan | nan | 88.9601 | 85.7795 | | MobileBertForQuestionAnswering | 64 | 7.9327 | 27.5186 | nan | nan | 74.7874 | 71.876 | | MegatronBertForCausalLM | 16 | 3.0219 | 12.5327 | 19.6699 | nan | 61.5191 | 59.8845 | | MegatronBertForQuestionAnswering | 16 | 3.0691 | 13.2977 | 19.1034 | nan | 60.2609 | 58.2808 | | LayoutLMForSequenceClassification | 16 | 1.6734 | 6.6917 | 10.1343 | nan | 59.7267 | 60.187 | | T5ForConditionalGeneration | 4 | 2.1399 | 8.8895 | nan | nan | 58.3394 | 57.0848 | | PegasusForConditionalGeneration | 16 | 2.6227 | 14.7158 | 24.2283 | nan | 58.1897 | 54.3056 | | BartForConditionalGeneration | 2 | 2.8248 | 15.0065 | nan | nan | 57.0652 | 54.7753 | | T5Small | 1 | 2.1902 | 8.9903 | nan | nan | 55.4364 | 53.2137 | | MBartForConditionalGeneration | 16 | 2.7868 | 15.512 | nan | nan | 54.3119 | 53.1455 | | PLBartForConditionalGeneration | 16 | 1.3887 | 8.298 | nan | nan | 47.5246 | 46.3964 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7139 | 10.0168 | nan | nan | 43.6075 | 41.5748 | | BigBird | 1 | 7.296 | 13.5333 | 29.6711 | nan | 40.7238 | 26.8699 | | ElectraForCausalLM | 32 | 1.2891 | 6.2441 | nan | nan | 40.6712 | 39.969 | | DistillGPT2 | 1 | 0.6422 | 3.1221 | 4.4918 | nan | 33.8479 | 32.6814 | | LayoutLMForMaskedLM | 16 | 1.6131 | 6.6316 | nan | nan | 32.8126 | 32.5964 | | BertForMaskedLM | 64 | 1.2973 | 6.3901 | 9.4361 | nan | 32.777 | 31.6779 | | ElectraForQuestionAnswering | 64 | 1.3222 | 6.4111 | nan | nan | 32.5117 | 31.4854 | | GPT2ForSequenceClassification | 4 | 1.2751 | 6.1953 | nan | nan | 32.0765 | 31.1399 | | RobertaForCausalLM | 64 | 1.3104 | 6.1902 | 9.2915 | nan | 28.0396 | 27.4422 | | BertForQuestionAnswering | 128 | 1.3166 | 6.2802 | nan | nan | 27.7294 | 27.1936 | | PegasusForCausalLM | 32 | 1.0161 | 5.707 | 8.775 | nan | 27.1087 | 25.1376 | | MBartForCausalLM | 32 | 0.9522 | 5.5767 | nan | nan | 25.4243 | 24.6154 | | RobertaForQuestionAnswering | 128 | 1.3205 | 6.387 | nan | nan | 24.5494 | 23.8515 | | TrOCRForCausalLM | 32 | 0.9241 | 5.5701 | nan | nan | 24.4333 | 24.1797 | | BartForCausalLM | 4 | 1.0079 | 5.6176 | nan | nan | 24.3593 | 23.6588 | | AlbertForMaskedLM | 4 | 1.1157 | 5.8703 | nan | nan | 23.8611 | 23.0601 | | GoogleFnet | 1 | 0.7904 | 3.3495 | 10.4595 | 9.6049 | 23.8114 | 16.1369 | | BlenderbotSmallForCausalLM | 64 | 0.6439 | 3.7467 | 5.6889 | nan | 23.625 | 22.6972 | | DistilBertForMaskedLM | 64 | 0.4729 | 2.9552 | 5.8879 | nan | 23.0127 | 22.634 | | AlbertForQuestionAnswering | 4 | 1.1461 | 5.9483 | nan | nan | 22.7287 | 21.5179 | | OPTForCausalLM | 32 | 1.0353 | 5.881 | nan | nan | 21.8562 | 20.7457 | | DistilBertForQuestionAnswering | 64 | 0.4816 | 3.0171 | 5.9235 | nan | 21.8186 | 22.1039 | | CamemBert | 1 | 1.38 | 6.1479 | 8.5874 | nan | 21.7413 | 21.2151 | | Speech2Text2ForCausalLM | 128 | 0.577 | 2.9045 | 4.6098 | nan | 19.6271 | 18.24 | | PLBartForCausalLM | 32 | 0.4938 | 2.9552 | 4.3734 | nan | 18.8954 | 18.2071 | | AllenaiLongformerBase | 1 | 5.9078 | 14.4262 | 80.0409 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0596 | 1.1223 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0108 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9984 | 0.8218 | 0.3795 | nan | 0.8173 | 0.9383 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | nan | nan | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8198 | nan | nan | 0.808 | 0.8738 | | BartForConditionalGeneration | 2 | 1.0 | 0.893 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | M2M100ForConditionalGeneration | 8 | 1.007 | 0.9507 | 0.3799 | nan | 0.7712 | 1.016 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3614 | nan | 0.7492 | 0.9186 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8743 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0246 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | nan | 0.6775 | 0.8801 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.6731 | 0.905 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8993 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | nan | 0.6375 | 0.8975 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | nan | 0.4267 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9956 | 0.8421 | 1.2485 | 1.8144 | 1.7733 | | lcnet_050 | 128 | 0.9568 | 0.9489 | 0.7675 | 1.4962 | 1.6425 | 1.6316 | | coat_lite_mini | 128 | 1.0 | 1.0 | 0.8447 | 1.0566 | 1.6056 | 1.5895 | | regnety_002 | 128 | 0.9778 | 0.9844 | 0.8615 | 1.3561 | 1.4813 | 1.3447 | | dm_nfnet_f0 | 128 | 1.0 | 1.0003 | 0.0 | 1.2124 | 1.4725 | 1.422 | | xcit_large_24_p8_224 | 5 | 1.003 | 1.0032 | 0.0 | 0.0 | 1.4529 | 1.4094 | | hrnet_w18 | 128 | 0.9999 | 0.9985 | 0.0 | 1.3201 | 1.418 | 1.3775 | | volo_d1_224 | 64 | 0.9999 | 0.9959 | 0.0 | 1.1295 | 1.3859 | 1.3634 | | dla102 | 128 | 1.0002 | 1.0008 | 0.0 | 1.2853 | 1.3821 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7891 | 0.0 | 1.0518 | 1.3733 | 1.3288 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2307 | 1.3564 | 1.3208 | | mobilenetv2_100 | 128 | 0.9662 | 0.9648 | 0.7065 | 1.0145 | 1.3373 | 1.3526 | | mobilenetv3_large_100 | 128 | 0.9664 | 0.9632 | 0.7654 | 1.1624 | 1.3356 | 1.3413 | | crossvit_9_240 | 128 | 0.9999 | 0.9988 | 0.0 | 1.0243 | 1.3305 | 1.3051 | | adv_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1253 | 1.328 | 1.3083 | | gluon_inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1224 | 1.3249 | 1.3075 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1257 | 1.3244 | 1.3076 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.166 | 1.3121 | 1.2748 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 1.1963 | 1.3115 | 1.2714 | | gmixer_24_224 | 128 | 0.9999 | 0.8348 | 0.0 | 0.98 | 1.2974 | 1.2696 | | fbnetv3_b | 128 | 0.9642 | 0.9614 | 0.7623 | 1.1326 | 1.283 | 1.2951 | | botnet26t_256 | 128 | 0.9851 | 0.9857 | 0.7892 | 1.2271 | 1.2742 | 1.2801 | | jx_nest_base | 32 | 0.9998 | 0.9926 | 0.0 | 1.217 | 1.2725 | 1.2481 | | sebotnet33ts_256 | 64 | 0.9753 | 0.8072 | 0.0 | 1.0528 | 1.2706 | 1.2762 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7721 | 0.0 | 1.0301 | 1.2706 | 1.2477 | | selecsls42b | 128 | 0.9998 | 0.9991 | 0.8157 | 1.2083 | 1.2671 | 1.2514 | | tf_efficientnet_b0 | 128 | 0.9776 | 0.7843 | 0.0 | 0.9848 | 1.2613 | 1.2686 | | mnasnet_100 | 128 | 0.9663 | 0.9639 | 0.7855 | 1.1575 | 1.2598 | 1.2787 | | eca_halonext26ts | 128 | 0.9877 | 0.7787 | 0.0 | 1.0289 | 1.2502 | 1.2494 | | fbnetc_100 | 128 | 0.967 | 0.9622 | 0.7908 | 1.1879 | 1.2497 | 1.2635 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9777 | 0.7445 | 1.1452 | 1.2404 | 1.2461 | | spnasnet_100 | 128 | 0.9605 | 0.9573 | 0.7734 | 1.1366 | 1.2375 | 1.2543 | | cspdarknet53 | 64 | 0.9581 | 0.9526 | 0.7322 | 1.1835 | 1.2287 | 1.2391 | | res2net101_26w_4s | 64 | 0.9997 | 0.9972 | 0.7705 | 1.1739 | 1.2283 | 1.1885 | | convit_base | 64 | 0.9998 | 0.9992 | 0.0 | 1.195 | 1.2216 | 1.2164 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.055 | 1.221 | 1.211 | | gmlp_s16_224 | 128 | 1.0 | 0.9994 | 0.0 | 0.9989 | 1.2164 | 1.2053 | | rexnet_100 | 128 | 0.9723 | 0.8169 | 0.0 | 0.9835 | 1.2142 | 1.2193 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.0838 | 1.2112 | 1.1932 | | tinynet_a | 128 | 0.9659 | 0.7757 | 0.6205 | 0.9713 | 1.1925 | 1.1949 | | cait_m36_384 | 4 | 0.9998 | 0.0 | 0.0 | 0.0 | 1.1826 | 1.158 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 1.0177 | 1.173 | 1.1697 | | dpn107 | 32 | 0.958 | 0.9367 | 0.7817 | 1.0288 | 1.1726 | 1.202 | | mobilevit_s | 64 | 0.9792 | 0.762 | 0.0 | 0.9468 | 1.1702 | 1.1666 | | repvgg_a2 | 128 | 0.9641 | 0.9623 | 0.8288 | 1.1224 | 1.1692 | 1.1652 | | poolformer_m36 | 64 | 0.9998 | 0.9993 | 0.0 | 0.0 | 1.1661 | 1.1475 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0185 | 1.1534 | 1.1505 | | twins_pcpvt_base | 64 | 1.0001 | 0.9974 | 0.75 | 1.0624 | 1.148 | 1.1172 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9785 | 0.0 | 0.9932 | 1.1469 | 1.1322 | | convnext_base | 64 | 0.9999 | 0.9988 | 0.0 | 1.0441 | 1.1157 | 1.1262 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9801 | 0.0 | 0.9504 | 1.1141 | 1.1053 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9988 | 0.0 | 1.1071 | 1.1068 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9995 | 0.7673 | 1.0156 | 1.0955 | 1.0834 | | gluon_xception65 | 32 | 0.9998 | 0.9975 | 0.0 | 1.0403 | 1.0871 | 1.0759 | | vit_base_patch16_224 | 64 | 1.0002 | 0.999 | 0.7662 | 0.9763 | 1.0855 | 1.0734 | | mixer_b16_224 | 128 | 1.0006 | 1.0001 | 0.0 | 0.9771 | 1.0808 | 1.0736 | | convmixer_768_32 | 32 | 0.9999 | 1.0002 | 0.0 | 1.0615 | 1.0783 | 1.0744 | | gernet_l | 128 | 0.9744 | 0.9723 | 0.8239 | 1.0992 | 1.075 | 1.0704 | | visformer_small | 128 | 1.0001 | 1.0022 | 0.797 | 1.0217 | 1.0495 | 1.0162 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6956 | 0.0 | 0.9499 | 0.9719 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 1.6263 | 0.0 | 1.5436 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.064 | 13.0072 | 21.5012 | 42.855 | 431.1592 | 426.4103 | | coat_lite_mini | 128 | 1.0194 | 5.4653 | 7.961 | 14.7686 | 362.4216 | 372.6703 | | mobilevit_s | 64 | 1.5683 | 7.1641 | nan | 42.4621 | 233.8428 | 237.9062 | | eca_halonext26ts | 128 | 1.4144 | 5.4751 | nan | 55.2357 | 204.8437 | 207.0974 | | sebotnet33ts_256 | 64 | 1.7651 | 6.6709 | nan | 51.039 | 185.8238 | 191.2608 | | eca_botnext26ts_256 | 128 | 1.3797 | 5.2911 | nan | 52.9221 | 179.8768 | 176.7545 | | swin_base_patch4_window7_224 | 64 | 2.5123 | 12.7354 | nan | 58.0591 | 177.0112 | 174.7488 | | xcit_large_24_p8_224 | 5 | 2.603 | 17.1709 | nan | nan | 172.3324 | 164.8544 | | jx_nest_base | 32 | 1.6708 | 9.2321 | nan | 57.8786 | 155.4547 | 156.5451 | | convnext_base | 64 | 1.2341 | 5.9929 | nan | 20.8438 | 133.0295 | 129.8216 | | cait_m36_384 | 4 | 2.6486 | nan | nan | nan | 132.7509 | 130.12 | | hrnet_w18 | 128 | 5.6217 | 31.9848 | nan | 251.7181 | 106.8258 | 100.7524 | | botnet26t_256 | 128 | 1.3057 | 4.4635 | 10.0598 | 40.2751 | 106.2411 | 103.5341 | | crossvit_9_240 | 128 | 1.3396 | 7.9862 | nan | 27.0701 | 97.9064 | 96.8689 | | resnest101e | 64 | 2.998 | 16.9945 | nan | 78.2291 | 93.9541 | 89.7619 | | pnasnet5large | 16 | 4.1626 | 22.9703 | nan | 123.7628 | 87.4338 | 84.1545 | | volo_d1_224 | 64 | 1.1595 | 7.6273 | nan | 28.0879 | 85.2424 | 83.6849 | | gmlp_s16_224 | 128 | 0.9511 | 6.2939 | nan | 13.365 | 71.7498 | 69.4367 | | visformer_small | 128 | 0.9009 | 4.189 | 6.2793 | 24.3038 | 71.1462 | 69.6831 | | pit_b_224 | 64 | 0.9339 | 4.8631 | nan | 12.5251 | 66.2774 | 65.1378 | | res2net101_26w_4s | 64 | 2.9852 | 17.3432 | 28.4155 | 80.897 | 55.6027 | 52.0513 | | gmixer_24_224 | 128 | 1.0133 | 7.3092 | nan | 16.5474 | 51.9895 | 50.5586 | | convit_base | 64 | 0.9843 | 5.9421 | nan | 18.0525 | 50.9922 | 49.952 | | res2net50_14w_8s | 128 | 2.5693 | 15.6494 | nan | 98.8662 | 50.8157 | 49.7271 | | gluon_xception65 | 32 | 1.6885 | 11.1965 | nan | 41.7582 | 49.2318 | 45.5937 | | poolformer_m36 | 64 | 1.8121 | 9.7062 | nan | nan | 47.0371 | 44.6651 | | resmlp_12_224 | 128 | 0.6088 | 2.794 | 5.5064 | nan | 42.3381 | 38.0426 | | swsl_resnext101_32x16d | 32 | 1.6289 | 10.0288 | nan | 39.6141 | 41.9677 | 41.3616 | | dpn107 | 32 | 3.7727 | 14.7274 | 45.6394 | 76.1359 | 40.3245 | 37.6555 | | mixer_b16_224 | 128 | 0.6548 | 3.2155 | nan | 10.7856 | 37.0102 | 35.4768 | | deit_base_distilled_patch16_224 | 64 | 0.8289 | 4.303 | 6.6094 | 10.4203 | 36.0592 | 34.6956 | | convmixer_768_32 | 32 | 1.0862 | 6.4498 | nan | 13.7196 | 35.8067 | 33.0945 | | fbnetv3_b | 128 | 3.0734 | 11.1026 | 29.9803 | 76.0043 | 35.7771 | 33.8855 | | vit_base_patch16_224 | 64 | 0.8583 | 4.1826 | 6.5315 | 9.6845 | 35.7583 | 35.0589 | | gluon_inception_v3 | 128 | 1.4815 | 8.9849 | nan | 66.9443 | 35.0345 | 32.4497 | | inception_v3 | 128 | 1.4787 | 9.0238 | nan | 67.1459 | 34.8548 | 32.5473 | | adv_inception_v3 | 128 | 1.4876 | 8.9769 | nan | 66.9311 | 34.3905 | 32.5332 | | tf_mixnet_l | 128 | 5.7484 | 13.3541 | nan | 68.7911 | 33.8729 | 32.1963 | | ghostnet_100 | 128 | 2.6432 | 9.6507 | 13.7666 | 58.927 | 32.695 | 30.8681 | | beit_base_patch16_224 | 64 | 1.0871 | 5.6134 | nan | 13.7621 | 32.6318 | 30.8008 | | mixnet_l | 128 | 5.3204 | 12.7271 | nan | 67.9763 | 32.5983 | 31.893 | | dm_nfnet_f0 | 128 | 2.0094 | 7.6042 | nan | 29.9754 | 32.3805 | 29.3454 | | dla102 | 128 | 1.6603 | 10.0975 | nan | 63.1714 | 32.1124 | 30.2312 | | res2next50 | 128 | 1.4989 | 8.7791 | nan | 66.7002 | 29.6202 | 27.9053 | | rexnet_100 | 128 | 1.8062 | 7.4568 | nan | 102.1027 | 26.5523 | 25.3591 | | tinynet_a | 128 | 1.9614 | 8.2078 | 20.2872 | 61.7507 | 25.7941 | 24.6542 | | cspdarknet53 | 64 | 2.2264 | 7.7188 | 20.8213 | 48.0307 | 23.2515 | 22.0433 | | nfnet_l0 | 128 | 1.7245 | 7.5828 | nan | 27.3095 | 23.1165 | 21.8966 | | tf_efficientnet_b0 | 128 | 1.7202 | 6.9673 | nan | 61.9316 | 22.7574 | 21.5149 | | fbnetc_100 | 128 | 1.9567 | 6.9499 | 18.078 | 45.3002 | 21.9517 | 20.7368 | | spnasnet_100 | 128 | 1.9161 | 6.665 | 17.4815 | 43.4797 | 21.4795 | 20.4556 | | mobilenetv3_large_100 | 128 | 1.5899 | 5.5688 | 13.4352 | 64.4429 | 19.9372 | 19.5642 | | mnasnet_100 | 128 | 1.6356 | 5.5127 | 14.0767 | 37.4665 | 18.8558 | 18.0133 | | mobilenetv2_100 | 128 | 1.6442 | 5.4933 | 13.7945 | 37.5793 | 18.5669 | 17.7858 | | gernet_l | 128 | 1.8816 | 6.4469 | 16.2236 | 35.9904 | 18.4345 | 17.2115 | | repvgg_a2 | 128 | 1.8567 | 6.1905 | 15.7371 | 43.751 | 17.9569 | 16.9557 | | regnety_002 | 128 | 1.4855 | 5.8417 | 13.8786 | 46.2472 | 17.8219 | 17.3541 | | selecsls42b | 128 | 0.7717 | 4.0352 | 5.8995 | 39.8612 | 16.4046 | 15.3492 | | lcnet_050 | 128 | 0.9705 | 3.4278 | 7.1291 | 31.167 | 13.6937 | 12.51 | | ese_vovnet19b_dw | 128 | 0.9768 | 3.251 | 6.9304 | 30.8107 | 12.7375 | 11.8284 | | tnt_s_patch16_224 | 128 | 1.4723 | 10.2065 | nan | 22.8828 | nan | 50.0197 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2908 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1793 | 1.2286 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1185 | 1.1745 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8403 | 0.9888 | 1.0866 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9799 | 0.9971 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9611 | 1.0549 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.9576 | 0.9855 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.9187 | 1.0509 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.9068 | 1.0518 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.8417 | 1.0633 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.8404 | 1.0528 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.6496 | 0.8704 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs/huggingface_float32.png : ![](https://i.imgur.com/jHusgG1.png) ../test-dynamo-runner-logs/timm_models_float32.png : ![](https://i.imgur.com/dNgQFLH.png) ../test-dynamo-runner-logs/torchbench_float32.png : ![](https://i.imgur.com/9FXvP0n.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 40/55 | 47%, 20/43  | 39%, 24/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/43   | 89%, 54/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 91%, 50/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.48x    |    1.28x    |    1.25x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.08    |    2.22     |    1.88     |
|       aot_eager        |    6.92    |    9.05     |    8.70     |
|     aot_cudagraphs     |    8.23    |    18.64    |    15.25    |
|      aot_nvfuser       |   20.32    |    9.60     |    50.01    |
|        inductor        |   62.17    |    52.98    |    73.89    |
| inductor_no_cudagraphs |   64.61    |    49.17    |    72.74    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.82x    |    0.72x    |    0.97x    |
| inductor_no_cudagraphs |   0.94x    |    0.96x    |    1.02x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.818 | 0.9452 | | torchbench | dlrm | 1.0006 | 0.0 | | torchbench | nvidia_deeprecommender | 0.904 | 0.9643 | | torchbench | hf_GPT2_large | 0.0 | 1.3706 | | torchbench | hf_T5 | 0.0 | 1.5515 | | torchbench | tacotron2 | 0.0 | 0.9362 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9499 | 0.9719 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5436 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------+----------+------------------------+ | torchbench | timm_efficientdet | 484.0577 | 488.767 | | torchbench | yolov3 | 419.4861 | 419.8955 | | torchbench | hf_T5_large | 205.3317 | 202.2279 | | torchbench | timm_vision_transformer | 153.43 | 160.5928 | | torchbench | speech_transformer | 152.3735 | 147.9389 | | torchbench | timm_resnest | 150.1654 | 145.0659 | | torchbench | attention_is_all_you_need_pytorch | 137.7387 | 139.7203 | | torchbench | timm_vision_transformer_large | 126.2802 | 123.9619 | | torchbench | dlrm | 3.4517 | nan | | torchbench | hf_GPT2_large | nan | 143.1625 | | torchbench | tacotron2 | nan | 106.378 | | torchbench | hf_T5 | nan | 44.804 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | XGLMForCausalLM | 203.4086 | 201.0863 | | huggingface | DebertaForMaskedLM | 163.7151 | 106.9608 | | huggingface | DebertaForQuestionAnswering | 152.0741 | 118.2059 | | huggingface | M2M100ForConditionalGeneration | 128.0751 | 124.2115 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | twins_pcpvt_base | 431.1592 | 426.4103 | | timm_models | coat_lite_mini | 362.4216 | 372.6703 | | timm_models | mobilevit_s | 233.8428 | 237.9062 | | timm_models | eca_halonext26ts | 204.8437 | 207.0974 | | timm_models | sebotnet33ts_256 | 185.8238 | 191.2608 | | timm_models | eca_botnext26ts_256 | 179.8768 | 176.7545 | | timm_models | swin_base_patch4_window7_224 | 177.0112 | 174.7488 | | timm_models | xcit_large_24_p8_224 | 172.3324 | 164.8544 | | timm_models | jx_nest_base | 155.4547 | 156.5451 | | timm_models | convnext_base | 133.0295 | 129.8216 | | timm_models | cait_m36_384 | 132.7509 | 130.12 | | timm_models | tnt_s_patch16_224 | nan | 50.0197 | +-------------+-----------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 0.9697 | | torchbench | speech_transformer | 0.896 | 0.8996 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.8848 | 0.9654 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.8964 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8616 | 1.0285 | | torchbench | pytorch_unet | 0.859 | 0.8608 | | torchbench | resnet50 | 0.8564 | 0.8913 | | torchbench | densenet121 | 0.8562 | 0.9307 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | hf_Bart | 0.8503 | 1.1284 | | torchbench | fastNLP_Bert | 0.8354 | 1.0952 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.825 | 1.0689 | | torchbench | hf_BigBird | 0.8211 | 1.0393 | | torchbench | dcgan | 0.767 | 0.7903 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | timm_vision_transformer | 0.7478 | 0.8187 | | torchbench | alexnet | 0.743 | 0.8332 | | torchbench | timm_vovnet | 0.7286 | 0.7339 | | torchbench | LearningToPaint | 0.7133 | 0.7462 | | torchbench | hf_Bert | 0.7048 | 0.985 | | torchbench | dlrm | 0.7035 | nan | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | hf_DistilBert | 0.6596 | 0.9466 | | torchbench | vgg16 | 0.6471 | 0.6497 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4682 | 0.6183 | | torchbench | pytorch_struct | 0.4222 | 0.429 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4212 | | torchbench | hf_Reformer | 0.299 | 0.9882 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | tacotron2 | nan | 1.1496 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | BigBird | 0.8224 | 1.0108 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8173 | 0.9383 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.808 | 0.8738 | | huggingface | BartForConditionalGeneration | 0.7817 | 0.9515 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | M2M100ForConditionalGeneration | 0.7712 | 1.016 | | huggingface | GoogleFnet | 0.7698 | 0.9373 | | huggingface | MT5ForConditionalGeneration | 0.7623 | 0.9396 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7492 | 0.9186 | | huggingface | PLBartForConditionalGeneration | 0.7397 | 0.9638 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0246 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9248 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.8801 | | huggingface | OPTForCausalLM | 0.6761 | 0.8847 | | huggingface | ElectraForCausalLM | 0.6731 | 0.905 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8993 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8975 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.4267 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.3264 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 0.9192 | | timm_models | adv_inception_v3 | 0.8983 | 0.9073 | | timm_models | gluon_inception_v3 | 0.8983 | 0.9073 | | timm_models | inception_v3 | 0.8983 | 0.9073 | | timm_models | mnasnet_100 | 0.8961 | 0.9077 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9249 | | timm_models | lcnet_050 | 0.8921 | 0.923 | | timm_models | cspdarknet53 | 0.8835 | 0.8875 | | timm_models | res2net50_14w_8s | 0.881 | 0.9327 | | timm_models | regnety_002 | 0.8617 | 0.8993 | | timm_models | botnet26t_256 | 0.8605 | 0.8702 | | timm_models | pit_b_224 | 0.8417 | 1.0633 | | timm_models | fbnetc_100 | 0.8416 | 0.8498 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9711 | | timm_models | coat_lite_mini | 0.8404 | 1.0528 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.8234 | | timm_models | repvgg_a2 | 0.7684 | 0.8011 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6496 | 0.8704 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs/passrate_over_time.png : ![](https://i.imgur.com/FJngzNP.png) ../test-dynamo-runner-logs/geomean_over_time.png : ![](https://i.imgur.com/Ua0P4PA.png)

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0028 | 0.9993 | 2.3219 | 1.443 | 5.4438 | 1.3058 | | timm_efficientdet | 1 | 0.9824 | 0.8845 | 0.0 | 0.0 | 4.2758 | 1.526 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.9777 | 2.1532 | 1.1969 | 3.6923 | 1.2407 | | timm_vision_transformer | 8 | 1.0068 | 0.9447 | 1.5339 | 1.3578 | 2.5716 | 1.4121 | | drq | 1 | 1.0315 | 0.8503 | 1.3708 | 1.0638 | 2.4195 | 1.0737 | | resnext50_32x4d | 8 | 1.0007 | 1.079 | 1.2092 | 1.3669 | 2.0959 | 1.2162 | | mobilenet_v3_large | 32 | 1.0078 | 1.1087 | 1.0365 | 1.3781 | 1.9864 | 1.3795 | | BERT_pytorch | 16 | 1.0104 | 0.8854 | 0.0 | 0.0 | 1.9168 | 1.9012 | | resnet18 | 16 | 1.006 | 1.1021 | 1.168 | 1.3958 | 1.8428 | 1.2045 | | pytorch_struct | 200 | 0.9977 | 0.7381 | 0.8734 | 0.8906 | 1.827 | 1.1633 | | lennard_jones | 1000 | 0.976 | 0.8293 | 1.0524 | 1.0142 | 1.818 | 0.9452 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.9377 | 1.2471 | 1.1785 | 1.7636 | 1.3013 | | squeezenet1_1 | 32 | 0.9979 | 0.9923 | 1.0527 | 1.1557 | 1.7406 | 1.2709 | | hf_Albert | 8 | 1.0015 | 0.9976 | 0.752 | 0.0 | 1.6466 | 1.6414 | | dcgan | 32 | 0.9829 | 1.0102 | 1.2585 | 1.1788 | 1.6306 | 1.0725 | | hf_T5_large | 2 | 1.0248 | 0.9068 | 0.0 | 0.0 | 1.5833 | 1.5731 | | speech_transformer | 32 | 1.0038 | 0.9068 | 0.0 | 0.0 | 1.5684 | 1.544 | | shufflenet_v2_x1_0 | 128 | 1.0005 | 1.0532 | 0.8062 | 1.1931 | 1.53 | 1.3689 | | timm_resnest | 32 | 0.9996 | 1.0027 | 0.8044 | 1.1815 | 1.5191 | 1.4517 | | timm_nfnet | 128 | 0.9993 | 0.9999 | 0.0 | 1.2122 | 1.4726 | 1.4222 | | mnasnet1_0 | 32 | 0.9993 | 1.0945 | 0.8568 | 1.2932 | 1.4577 | 1.2734 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.978 | 0.0 | 0.0 | 1.4527 | 1.4479 | | mobilenet_v2 | 96 | 0.9998 | 1.0003 | 0.7313 | 1.0443 | 1.4287 | 1.4088 | | hf_GPT2 | 4 | 1.0046 | 0.9827 | 0.738 | 0.0 | 1.4239 | 1.4306 | | soft_actor_critic | 256 | 0.9921 | 0.7715 | 1.1241 | 0.9985 | 1.4185 | 0.9565 | | resnet50_quantized_qat | 32 | 1.0019 | 0.9619 | 0.0 | 0.0 | 1.401 | 1.3947 | | fastNLP_Bert | 6 | 0.9997 | 0.9761 | 0.7528 | 0.0 | 1.3686 | 1.3445 | | timm_efficientnet | 32 | 0.9551 | 0.8076 | 0.7031 | 1.0629 | 1.3353 | 1.2011 | | LearningToPaint | 96 | 1.0048 | 1.0586 | 0.8687 | 1.2057 | 1.2627 | 1.2074 | | pytorch_unet | 1 | 1.0001 | 0.9982 | 0.8464 | 1.0765 | 1.2042 | 1.1861 | | resnet50 | 32 | 0.9994 | 0.9937 | 0.7608 | 1.1612 | 1.204 | 1.1695 | | Super_SloMo | 6 | 1.0003 | 0.9974 | 0.8669 | 0.0 | 1.18 | 1.1645 | | hf_Bart | 4 | 1.0127 | 0.9757 | 0.0 | 0.0 | 1.1721 | 1.1653 | | vgg16 | 64 | 1.0 | 0.999 | 0.859 | 0.9973 | 1.1707 | 1.1652 | | alexnet | 128 | 0.9991 | 0.998 | 0.8031 | 1.0004 | 1.163 | 1.1651 | | hf_Bert | 4 | 1.0214 | 0.944 | 0.7306 | 0.0 | 1.1575 | 1.1396 | | hf_DistilBert | 8 | 0.9999 | 0.9569 | 0.6872 | 0.0 | 1.1481 | 1.1546 | | timm_regnet | 32 | 0.9653 | 0.9617 | 0.7795 | 1.096 | 1.1283 | 1.0941 | | pytorch_stargan | 16 | 0.9997 | 0.983 | 0.866 | 0.9896 | 1.1189 | 1.0913 | | Background_Matting | 4 | 1.0006 | 1.0218 | 0.866 | 1.0816 | 1.1153 | 1.1069 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9267 | 0.0 | 1.1095 | 1.1343 | | hf_BigBird | 2 | 0.9915 | 0.939 | 0.9612 | 0.0 | 1.0921 | 1.0042 | | yolov3 | 16 | 1.0 | 0.9954 | 0.7893 | 1.1839 | 1.0795 | 1.0647 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9726 | 0.0 | 0.0 | 1.047 | 1.033 | | timm_vision_transformer_large | 8 | 0.9982 | 0.9912 | 0.0 | 0.9805 | 1.044 | 1.0331 | | tts_angular | 64 | 0.9937 | 0.964 | 0.9933 | 1.0231 | 1.0136 | 1.0218 | | timm_vovnet | 32 | 0.9102 | 0.9045 | 0.7132 | 0.9774 | 1.0069 | 1.0176 | | dlrm | 2048 | 1.0064 | 1.0734 | 0.0 | 0.0 | 1.0006 | 0.0 | | demucs | 4 | 0.9997 | 0.9998 | 0.999 | 0.9999 | 1.0 | 1.0007 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.585 | 0.942 | 0.904 | 0.9643 | | hf_GPT2_large | 4 | 1.0004 | 0.9805 | 0.0 | 0.0 | 0.0 | 1.3706 | | hf_T5 | 8 | 1.0002 | 0.9932 | 0.0 | 0.0 | 0.0 | 1.5515 | | tacotron2 | 64 | 0.981 | 0.8581 | 0.0 | 0.0 | 0.0 | 0.9362 | | hf_Longformer | 2 | 0.9701 | 0.9013 | 0.8196 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5344 | 38.4011 | nan | nan | 484.0577 | 488.767 | | yolov3 | 16 | 2.7711 | 8.6894 | 11.9084 | 43.4046 | 419.4861 | 419.8955 | | hf_T5_large | 2 | 13.2998 | 41.15 | nan | nan | 205.3317 | 202.2279 | | timm_vision_transformer | 8 | 0.7808 | 4.1474 | 5.8215 | 9.3655 | 153.43 | 160.5928 | | speech_transformer | 32 | 1.5424 | 8.2938 | nan | nan | 152.3735 | 147.9389 | | timm_resnest | 32 | 0.5383 | 2.6812 | 3.7424 | 35.1306 | 150.1654 | 145.0659 | | attention_is_all_you_need_pytorch | 256 | 1.0734 | 7.1292 | nan | nan | 137.7387 | 139.7203 | | timm_vision_transformer_large | 8 | 2.223 | 13.8751 | nan | 24.351 | 126.2802 | 123.9619 | | pytorch_stargan | 16 | 0.3789 | 2.3643 | 3.1326 | 3.9188 | 107.0355 | 104.0851 | | pytorch_struct | 200 | 0.2366 | 0.7827 | 1.3456 | 4.0715 | 99.505 | 98.1575 | | BERT_pytorch | 16 | 1.4194 | 7.614 | nan | nan | 92.0393 | 92.0811 | | fastNLP_Bert | 6 | 1.4306 | 6.6169 | 10.0451 | nan | 65.652 | 63.418 | | hf_GPT2 | 4 | 1.2488 | 6.1179 | 8.8738 | nan | 63.5447 | 63.521 | | hf_Bart | 4 | 1.3924 | 8.089 | nan | nan | 49.9676 | 49.9717 | | densenet121 | 4 | 1.9897 | 13.3477 | 20.1678 | 88.3763 | 45.0957 | 43.7205 | | mobilenet_v3_large | 32 | 0.8275 | 4.8204 | 6.7604 | 53.5764 | 44.9158 | 46.9735 | | hf_Albert | 8 | 1.0066 | 5.8746 | 8.5532 | nan | 41.987 | 41.132 | | hf_BigBird | 2 | 7.3861 | 13.5387 | 29.953 | nan | 41.2734 | 26.6352 | | resnet50_quantized_qat | 32 | 1.061 | 9.0448 | nan | nan | 39.8902 | 40.3176 | | hf_Bert | 4 | 1.312 | 6.2693 | 8.8293 | nan | 39.8395 | 38.7377 | | timm_regnet | 32 | 2.173 | 8.4238 | 20.7651 | 47.6157 | 37.2439 | 35.16 | | hf_Reformer | 4 | 2.3483 | nan | 9.1124 | nan | 36.065 | 30.7238 | | timm_efficientnet | 32 | 1.6787 | 6.665 | 16.1146 | 52.4346 | 34.2419 | 34.4653 | | mnasnet1_0 | 32 | 0.7461 | 4.4921 | 6.4014 | 30.714 | 31.0909 | 30.7546 | | resnet50 | 32 | 0.7937 | 4.9477 | 6.925 | 32.2699 | 31.0875 | 29.832 | | hf_DistilBert | 8 | 0.4278 | 3.0834 | 6.0696 | nan | 30.4362 | 29.5285 | | resnext50_32x4d | 8 | 0.8239 | 4.9203 | 6.8365 | 28.5464 | 30.2931 | 30.0266 | | timm_vovnet | 32 | 1.4222 | 4.5909 | 10.441 | 23.5649 | 30.0127 | 29.7463 | | timm_nfnet | 128 | 1.8844 | 7.7171 | nan | 29.8502 | 29.8712 | 28.8763 | | mobilenet_v2_quantized_qat | 96 | 1.1759 | 8.8754 | nan | nan | 27.0997 | 27.2946 | | functorch_dp_cifar10 | 64 | 0.3232 | 1.9699 | 2.8309 | 5.5366 | 26.1947 | 24.9937 | | resnet18 | 16 | 0.3858 | 1.8912 | 2.6752 | 17.5591 | 23.2902 | 20.4971 | | shufflenet_v2_x1_0 | 128 | 0.8656 | 5.4261 | 7.6883 | 26.8524 | 18.5748 | 17.9867 | | Super_SloMo | 6 | 0.9695 | 5.0542 | 6.7627 | nan | 17.3419 | 16.4668 | | Background_Matting | 4 | 0.6979 | 4.5367 | 6.7144 | 29.2894 | 16.7635 | 16.0163 | | mobilenet_v2 | 96 | 0.7343 | 4.4782 | 6.6781 | 37.1045 | 16.669 | 16.3002 | | pytorch_unet | 1 | 0.4223 | 2.1063 | 2.9975 | 19.6418 | 8.2272 | 7.7305 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3535 | 2.202 | 3.0539 | 3.8439 | 8.1719 | 8.0926 | | LearningToPaint | 96 | 0.4124 | 1.9651 | 2.8324 | 23.8303 | 7.2019 | 6.8944 | | squeezenet1_1 | 32 | 0.2563 | 0.9557 | 1.3863 | 4.5328 | 4.0598 | 3.8616 | | nvidia_deeprecommender | 256 | 0.1895 | 0.4298 | 0.6854 | 2.4393 | 4.0142 | 3.7143 | | drq | 1 | 0.1402 | 0.4424 | 0.8198 | 3.4662 | 3.7694 | 3.1945 | | vgg16 | 64 | 0.1869 | 0.6441 | 1.0464 | 2.4609 | 3.6811 | 3.2422 | | dlrm | 2048 | 0.4444 | 0.8198 | nan | nan | 3.4517 | nan | | soft_actor_critic | 256 | 0.2031 | 0.3372 | 0.4948 | 1.5206 | 3.0611 | 2.6231 | | alexnet | 128 | 0.1421 | 0.4161 | 0.6606 | 2.3558 | 2.9654 | 2.6911 | | dcgan | 32 | 0.1641 | 0.4494 | 0.6683 | 3.7309 | 2.678 | 2.4053 | | lennard_jones | 1000 | 0.1381 | 0.289 | 0.4429 | 1.0648 | 1.9631 | 1.736 | | tts_angular | 64 | 0.2061 | 0.2786 | 0.3976 | 1.0162 | 1.8605 | 1.6749 | | demucs | 4 | 0.2929 | 0.2934 | 0.2977 | 0.2969 | 0.2011 | 0.1967 | | hf_GPT2_large | 4 | 4.9818 | 19.3363 | nan | nan | nan | 143.1625 | | tacotron2 | 64 | 16.7009 | 28.6252 | nan | nan | nan | 106.378 | | hf_T5 | 8 | 2.1787 | 9.4406 | nan | nan | nan | 44.804 | | hf_Longformer | 2 | 5.7342 | 13.862 | 78.3703 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4314 | 1.4314 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.4036 | 1.4036 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2637 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1858 | 1.1912 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | nan | 0.9703 | 1.1094 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.8021 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.896 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.3919 | 0.9169 | 0.8848 | 0.9654 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.8964 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8616 | 1.0285 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3435 | 0.8551 | 0.8562 | 0.9307 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | hf_Bart | 4 | 0.9617 | 0.8598 | nan | nan | 0.8503 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | nan | 0.8354 | 1.0952 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3596 | 0.8203 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4301 | nan | 0.8211 | 1.0393 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.7478 | 0.8187 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6503 | 0.7133 | 0.7462 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7048 | 0.985 | | dlrm | 2048 | 0.7302 | 0.7305 | nan | nan | 0.7035 | nan | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3593 | 0.6971 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6596 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4682 | 0.6183 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1507 | | tacotron2 | 64 | 0.9906 | 1.093 | nan | nan | nan | 1.1496 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0285 | 0.9414 | 0.0 | 0.0 | 3.7345 | 1.5254 | | CamemBert | 1 | 1.0493 | 0.9732 | 1.3251 | 0.0 | 2.3889 | 1.5405 | | MT5ForConditionalGeneration | 8 | 1.0272 | 0.9263 | 0.0 | 0.0 | 2.2531 | 1.9848 | | DistillGPT2 | 1 | 1.0322 | 0.9458 | 1.0657 | 0.0 | 2.099 | 1.9009 | | MobileBertForMaskedLM | 32 | 1.023 | 0.9232 | 0.0 | 0.0 | 1.9829 | 1.574 | | GoogleFnet | 1 | 0.9985 | 0.8173 | 0.9815 | 1.1247 | 1.9188 | 1.1214 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.6662 | 1.6568 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9667 | 0.0 | 0.0 | 1.4388 | 1.4275 | | M2M100ForConditionalGeneration | 8 | 1.0412 | 0.8942 | 1.0013 | 0.0 | 1.4178 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.024 | 0.9187 | 0.0 | 0.0 | 1.4036 | 1.2789 | | ElectraForCausalLM | 32 | 1.0004 | 0.9312 | 0.0 | 0.0 | 1.3702 | 1.4028 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9844 | 0.0 | 0.0 | 1.3541 | 1.3368 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0018 | 0.0 | 0.0 | 1.2567 | 1.2522 | | AlbertForMaskedLM | 4 | 0.9993 | 0.9996 | 0.0 | 0.0 | 1.25 | 1.2519 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9892 | 0.7379 | 0.0 | 1.2473 | 1.2318 | | T5Small | 1 | 1.0191 | 0.9543 | 0.0 | 0.0 | 1.2442 | 1.2308 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9613 | 0.0 | 0.0 | 1.1874 | 1.188 | | OPTForCausalLM | 32 | 1.0037 | 0.932 | 0.0 | 0.0 | 1.1825 | 1.1983 | | XGLMForCausalLM | 8 | 1.0128 | 0.9394 | 0.0 | 0.0 | 1.1706 | 1.1753 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.971 | 0.0 | 0.0 | 1.1633 | 1.1716 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.985 | 0.7131 | 0.0 | 1.1444 | 1.1262 | | RobertaForCausalLM | 64 | 1.0004 | 0.9637 | 0.7465 | 0.0 | 1.1133 | 1.1212 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.9259 | 0.6593 | 0.0 | 1.11 | 1.1484 | | BigBird | 1 | 0.9894 | 0.937 | 0.991 | 0.0 | 1.1023 | 1.0034 | | BartForCausalLM | 4 | 1.0007 | 0.9668 | 0.0 | 0.0 | 1.0962 | 1.1067 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9887 | 0.0 | 0.0 | 1.0962 | 1.0896 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0104 | 0.7572 | 0.0 | 1.0947 | 1.0716 | | MBartForConditionalGeneration | 16 | 1.0102 | 0.9766 | 0.0 | 0.0 | 1.0887 | 1.0775 | | DebertaForMaskedLM | 4 | 0.9321 | 0.8111 | 0.7317 | 0.0 | 1.0885 | 1.0732 | | MegatronBertForCausalLM | 16 | 1.0332 | 1.0027 | 0.7578 | 0.0 | 1.087 | 1.0785 | | PegasusForConditionalGeneration | 16 | 1.0101 | 0.9819 | 0.7569 | 0.0 | 1.0857 | 1.0825 | | BertForQuestionAnswering | 128 | 0.9997 | 0.9882 | 0.0 | 0.0 | 1.0722 | 1.0661 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9942 | 0.0 | 0.0 | 1.0696 | 1.0709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9265 | 0.0 | 0.0 | 1.0628 | 1.0696 | | DebertaForQuestionAnswering | 8 | 0.9976 | 0.9917 | 0.6821 | 0.0 | 1.0623 | 1.2025 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9519 | 0.7122 | 0.0 | 1.0362 | 1.0546 | | BertForMaskedLM | 64 | 1.0003 | 0.9524 | 0.7302 | 0.0 | 1.0338 | 1.0381 | | PLBartForCausalLM | 32 | 1.0055 | 0.9348 | 0.7321 | 0.0 | 1.0224 | 1.0494 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9105 | 0.6827 | 0.0 | 1.0131 | 1.0345 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9556 | 0.0 | 0.0 | 0.9981 | 1.0096 | | MBartForCausalLM | 32 | 1.0013 | 0.9555 | 0.0 | 0.0 | 0.9967 | 1.0069 | | PegasusForCausalLM | 32 | 0.9998 | 0.953 | 0.7325 | 0.0 | 0.9888 | 1.0008 | | AllenaiLongformerBase | 1 | 0.953 | 0.7915 | 0.7884 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2364 | 12.2125 | nan | nan | 203.4086 | 201.0863 | | DebertaForMaskedLM | 4 | 4.684 | 11.0814 | 44.7781 | nan | 163.7151 | 106.9608 | | DebertaForQuestionAnswering | 8 | 4.5483 | 11.6349 | 43.993 | nan | 152.0741 | 118.2059 | | M2M100ForConditionalGeneration | 8 | 2.7543 | 15.4794 | 23.643 | nan | 128.0751 | 124.2115 | | YituTechConvBert | 1 | 2.0946 | 9.5284 | nan | nan | 115.4649 | 119.3641 | | MT5ForConditionalGeneration | 8 | 3.4744 | 13.6659 | nan | nan | 90.4534 | 91.1223 | | MobileBertForMaskedLM | 32 | 7.7855 | 27.1609 | nan | nan | 88.9601 | 85.7795 | | MobileBertForQuestionAnswering | 64 | 7.9327 | 27.5186 | nan | nan | 74.7874 | 71.876 | | MegatronBertForCausalLM | 16 | 3.0219 | 12.5327 | 19.6699 | nan | 61.5191 | 59.8845 | | MegatronBertForQuestionAnswering | 16 | 3.0691 | 13.2977 | 19.1034 | nan | 60.2609 | 58.2808 | | LayoutLMForSequenceClassification | 16 | 1.6734 | 6.6917 | 10.1343 | nan | 59.7267 | 60.187 | | T5ForConditionalGeneration | 4 | 2.1399 | 8.8895 | nan | nan | 58.3394 | 57.0848 | | PegasusForConditionalGeneration | 16 | 2.6227 | 14.7158 | 24.2283 | nan | 58.1897 | 54.3056 | | BartForConditionalGeneration | 2 | 2.8248 | 15.0065 | nan | nan | 57.0652 | 54.7753 | | T5Small | 1 | 2.1902 | 8.9903 | nan | nan | 55.4364 | 53.2137 | | MBartForConditionalGeneration | 16 | 2.7868 | 15.512 | nan | nan | 54.3119 | 53.1455 | | PLBartForConditionalGeneration | 16 | 1.3887 | 8.298 | nan | nan | 47.5246 | 46.3964 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7139 | 10.0168 | nan | nan | 43.6075 | 41.5748 | | BigBird | 1 | 7.296 | 13.5333 | 29.6711 | nan | 40.7238 | 26.8699 | | ElectraForCausalLM | 32 | 1.2891 | 6.2441 | nan | nan | 40.6712 | 39.969 | | DistillGPT2 | 1 | 0.6422 | 3.1221 | 4.4918 | nan | 33.8479 | 32.6814 | | LayoutLMForMaskedLM | 16 | 1.6131 | 6.6316 | nan | nan | 32.8126 | 32.5964 | | BertForMaskedLM | 64 | 1.2973 | 6.3901 | 9.4361 | nan | 32.777 | 31.6779 | | ElectraForQuestionAnswering | 64 | 1.3222 | 6.4111 | nan | nan | 32.5117 | 31.4854 | | GPT2ForSequenceClassification | 4 | 1.2751 | 6.1953 | nan | nan | 32.0765 | 31.1399 | | RobertaForCausalLM | 64 | 1.3104 | 6.1902 | 9.2915 | nan | 28.0396 | 27.4422 | | BertForQuestionAnswering | 128 | 1.3166 | 6.2802 | nan | nan | 27.7294 | 27.1936 | | PegasusForCausalLM | 32 | 1.0161 | 5.707 | 8.775 | nan | 27.1087 | 25.1376 | | MBartForCausalLM | 32 | 0.9522 | 5.5767 | nan | nan | 25.4243 | 24.6154 | | RobertaForQuestionAnswering | 128 | 1.3205 | 6.387 | nan | nan | 24.5494 | 23.8515 | | TrOCRForCausalLM | 32 | 0.9241 | 5.5701 | nan | nan | 24.4333 | 24.1797 | | BartForCausalLM | 4 | 1.0079 | 5.6176 | nan | nan | 24.3593 | 23.6588 | | AlbertForMaskedLM | 4 | 1.1157 | 5.8703 | nan | nan | 23.8611 | 23.0601 | | GoogleFnet | 1 | 0.7904 | 3.3495 | 10.4595 | 9.6049 | 23.8114 | 16.1369 | | BlenderbotSmallForCausalLM | 64 | 0.6439 | 3.7467 | 5.6889 | nan | 23.625 | 22.6972 | | DistilBertForMaskedLM | 64 | 0.4729 | 2.9552 | 5.8879 | nan | 23.0127 | 22.634 | | AlbertForQuestionAnswering | 4 | 1.1461 | 5.9483 | nan | nan | 22.7287 | 21.5179 | | OPTForCausalLM | 32 | 1.0353 | 5.881 | nan | nan | 21.8562 | 20.7457 | | DistilBertForQuestionAnswering | 64 | 0.4816 | 3.0171 | 5.9235 | nan | 21.8186 | 22.1039 | | CamemBert | 1 | 1.38 | 6.1479 | 8.5874 | nan | 21.7413 | 21.2151 | | Speech2Text2ForCausalLM | 128 | 0.577 | 2.9045 | 4.6098 | nan | 19.6271 | 18.24 | | PLBartForCausalLM | 32 | 0.4938 | 2.9552 | 4.3734 | nan | 18.8954 | 18.2071 | | AllenaiLongformerBase | 1 | 5.9078 | 14.4262 | 80.0409 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0596 | 1.1223 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0108 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9984 | 0.8218 | 0.3795 | nan | 0.8173 | 0.9383 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | nan | nan | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8198 | nan | nan | 0.808 | 0.8738 | | BartForConditionalGeneration | 2 | 1.0 | 0.893 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | M2M100ForConditionalGeneration | 8 | 1.007 | 0.9507 | 0.3799 | nan | 0.7712 | 1.016 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3614 | nan | 0.7492 | 0.9186 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8743 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0246 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | nan | 0.6775 | 0.8801 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.6731 | 0.905 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8993 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | nan | 0.6375 | 0.8975 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | nan | 0.4267 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9956 | 0.8421 | 1.2485 | 1.8144 | 1.7733 | | lcnet_050 | 128 | 0.9568 | 0.9489 | 0.7675 | 1.4962 | 1.6425 | 1.6316 | | coat_lite_mini | 128 | 1.0 | 1.0 | 0.8447 | 1.0566 | 1.6056 | 1.5895 | | regnety_002 | 128 | 0.9778 | 0.9844 | 0.8615 | 1.3561 | 1.4813 | 1.3447 | | dm_nfnet_f0 | 128 | 1.0 | 1.0003 | 0.0 | 1.2124 | 1.4725 | 1.422 | | xcit_large_24_p8_224 | 5 | 1.003 | 1.0032 | 0.0 | 0.0 | 1.4529 | 1.4094 | | hrnet_w18 | 128 | 0.9999 | 0.9985 | 0.0 | 1.3201 | 1.418 | 1.3775 | | volo_d1_224 | 64 | 0.9999 | 0.9959 | 0.0 | 1.1295 | 1.3859 | 1.3634 | | dla102 | 128 | 1.0002 | 1.0008 | 0.0 | 1.2853 | 1.3821 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7891 | 0.0 | 1.0518 | 1.3733 | 1.3288 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2307 | 1.3564 | 1.3208 | | mobilenetv2_100 | 128 | 0.9662 | 0.9648 | 0.7065 | 1.0145 | 1.3373 | 1.3526 | | mobilenetv3_large_100 | 128 | 0.9664 | 0.9632 | 0.7654 | 1.1624 | 1.3356 | 1.3413 | | crossvit_9_240 | 128 | 0.9999 | 0.9988 | 0.0 | 1.0243 | 1.3305 | 1.3051 | | adv_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1253 | 1.328 | 1.3083 | | gluon_inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1224 | 1.3249 | 1.3075 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1257 | 1.3244 | 1.3076 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.166 | 1.3121 | 1.2748 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 1.1963 | 1.3115 | 1.2714 | | gmixer_24_224 | 128 | 0.9999 | 0.8348 | 0.0 | 0.98 | 1.2974 | 1.2696 | | fbnetv3_b | 128 | 0.9642 | 0.9614 | 0.7623 | 1.1326 | 1.283 | 1.2951 | | botnet26t_256 | 128 | 0.9851 | 0.9857 | 0.7892 | 1.2271 | 1.2742 | 1.2801 | | jx_nest_base | 32 | 0.9998 | 0.9926 | 0.0 | 1.217 | 1.2725 | 1.2481 | | sebotnet33ts_256 | 64 | 0.9753 | 0.8072 | 0.0 | 1.0528 | 1.2706 | 1.2762 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7721 | 0.0 | 1.0301 | 1.2706 | 1.2477 | | selecsls42b | 128 | 0.9998 | 0.9991 | 0.8157 | 1.2083 | 1.2671 | 1.2514 | | tf_efficientnet_b0 | 128 | 0.9776 | 0.7843 | 0.0 | 0.9848 | 1.2613 | 1.2686 | | mnasnet_100 | 128 | 0.9663 | 0.9639 | 0.7855 | 1.1575 | 1.2598 | 1.2787 | | eca_halonext26ts | 128 | 0.9877 | 0.7787 | 0.0 | 1.0289 | 1.2502 | 1.2494 | | fbnetc_100 | 128 | 0.967 | 0.9622 | 0.7908 | 1.1879 | 1.2497 | 1.2635 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9777 | 0.7445 | 1.1452 | 1.2404 | 1.2461 | | spnasnet_100 | 128 | 0.9605 | 0.9573 | 0.7734 | 1.1366 | 1.2375 | 1.2543 | | cspdarknet53 | 64 | 0.9581 | 0.9526 | 0.7322 | 1.1835 | 1.2287 | 1.2391 | | res2net101_26w_4s | 64 | 0.9997 | 0.9972 | 0.7705 | 1.1739 | 1.2283 | 1.1885 | | convit_base | 64 | 0.9998 | 0.9992 | 0.0 | 1.195 | 1.2216 | 1.2164 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.055 | 1.221 | 1.211 | | gmlp_s16_224 | 128 | 1.0 | 0.9994 | 0.0 | 0.9989 | 1.2164 | 1.2053 | | rexnet_100 | 128 | 0.9723 | 0.8169 | 0.0 | 0.9835 | 1.2142 | 1.2193 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.0838 | 1.2112 | 1.1932 | | tinynet_a | 128 | 0.9659 | 0.7757 | 0.6205 | 0.9713 | 1.1925 | 1.1949 | | cait_m36_384 | 4 | 0.9998 | 0.0 | 0.0 | 0.0 | 1.1826 | 1.158 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 1.0177 | 1.173 | 1.1697 | | dpn107 | 32 | 0.958 | 0.9367 | 0.7817 | 1.0288 | 1.1726 | 1.202 | | mobilevit_s | 64 | 0.9792 | 0.762 | 0.0 | 0.9468 | 1.1702 | 1.1666 | | repvgg_a2 | 128 | 0.9641 | 0.9623 | 0.8288 | 1.1224 | 1.1692 | 1.1652 | | poolformer_m36 | 64 | 0.9998 | 0.9993 | 0.0 | 0.0 | 1.1661 | 1.1475 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0185 | 1.1534 | 1.1505 | | twins_pcpvt_base | 64 | 1.0001 | 0.9974 | 0.75 | 1.0624 | 1.148 | 1.1172 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9785 | 0.0 | 0.9932 | 1.1469 | 1.1322 | | convnext_base | 64 | 0.9999 | 0.9988 | 0.0 | 1.0441 | 1.1157 | 1.1262 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9801 | 0.0 | 0.9504 | 1.1141 | 1.1053 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9988 | 0.0 | 1.1071 | 1.1068 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9995 | 0.7673 | 1.0156 | 1.0955 | 1.0834 | | gluon_xception65 | 32 | 0.9998 | 0.9975 | 0.0 | 1.0403 | 1.0871 | 1.0759 | | vit_base_patch16_224 | 64 | 1.0002 | 0.999 | 0.7662 | 0.9763 | 1.0855 | 1.0734 | | mixer_b16_224 | 128 | 1.0006 | 1.0001 | 0.0 | 0.9771 | 1.0808 | 1.0736 | | convmixer_768_32 | 32 | 0.9999 | 1.0002 | 0.0 | 1.0615 | 1.0783 | 1.0744 | | gernet_l | 128 | 0.9744 | 0.9723 | 0.8239 | 1.0992 | 1.075 | 1.0704 | | visformer_small | 128 | 1.0001 | 1.0022 | 0.797 | 1.0217 | 1.0495 | 1.0162 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6956 | 0.0 | 0.9499 | 0.9719 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 1.6263 | 0.0 | 1.5436 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.064 | 13.0072 | 21.5012 | 42.855 | 431.1592 | 426.4103 | | coat_lite_mini | 128 | 1.0194 | 5.4653 | 7.961 | 14.7686 | 362.4216 | 372.6703 | | mobilevit_s | 64 | 1.5683 | 7.1641 | nan | 42.4621 | 233.8428 | 237.9062 | | eca_halonext26ts | 128 | 1.4144 | 5.4751 | nan | 55.2357 | 204.8437 | 207.0974 | | sebotnet33ts_256 | 64 | 1.7651 | 6.6709 | nan | 51.039 | 185.8238 | 191.2608 | | eca_botnext26ts_256 | 128 | 1.3797 | 5.2911 | nan | 52.9221 | 179.8768 | 176.7545 | | swin_base_patch4_window7_224 | 64 | 2.5123 | 12.7354 | nan | 58.0591 | 177.0112 | 174.7488 | | xcit_large_24_p8_224 | 5 | 2.603 | 17.1709 | nan | nan | 172.3324 | 164.8544 | | jx_nest_base | 32 | 1.6708 | 9.2321 | nan | 57.8786 | 155.4547 | 156.5451 | | convnext_base | 64 | 1.2341 | 5.9929 | nan | 20.8438 | 133.0295 | 129.8216 | | cait_m36_384 | 4 | 2.6486 | nan | nan | nan | 132.7509 | 130.12 | | hrnet_w18 | 128 | 5.6217 | 31.9848 | nan | 251.7181 | 106.8258 | 100.7524 | | botnet26t_256 | 128 | 1.3057 | 4.4635 | 10.0598 | 40.2751 | 106.2411 | 103.5341 | | crossvit_9_240 | 128 | 1.3396 | 7.9862 | nan | 27.0701 | 97.9064 | 96.8689 | | resnest101e | 64 | 2.998 | 16.9945 | nan | 78.2291 | 93.9541 | 89.7619 | | pnasnet5large | 16 | 4.1626 | 22.9703 | nan | 123.7628 | 87.4338 | 84.1545 | | volo_d1_224 | 64 | 1.1595 | 7.6273 | nan | 28.0879 | 85.2424 | 83.6849 | | gmlp_s16_224 | 128 | 0.9511 | 6.2939 | nan | 13.365 | 71.7498 | 69.4367 | | visformer_small | 128 | 0.9009 | 4.189 | 6.2793 | 24.3038 | 71.1462 | 69.6831 | | pit_b_224 | 64 | 0.9339 | 4.8631 | nan | 12.5251 | 66.2774 | 65.1378 | | res2net101_26w_4s | 64 | 2.9852 | 17.3432 | 28.4155 | 80.897 | 55.6027 | 52.0513 | | gmixer_24_224 | 128 | 1.0133 | 7.3092 | nan | 16.5474 | 51.9895 | 50.5586 | | convit_base | 64 | 0.9843 | 5.9421 | nan | 18.0525 | 50.9922 | 49.952 | | res2net50_14w_8s | 128 | 2.5693 | 15.6494 | nan | 98.8662 | 50.8157 | 49.7271 | | gluon_xception65 | 32 | 1.6885 | 11.1965 | nan | 41.7582 | 49.2318 | 45.5937 | | poolformer_m36 | 64 | 1.8121 | 9.7062 | nan | nan | 47.0371 | 44.6651 | | resmlp_12_224 | 128 | 0.6088 | 2.794 | 5.5064 | nan | 42.3381 | 38.0426 | | swsl_resnext101_32x16d | 32 | 1.6289 | 10.0288 | nan | 39.6141 | 41.9677 | 41.3616 | | dpn107 | 32 | 3.7727 | 14.7274 | 45.6394 | 76.1359 | 40.3245 | 37.6555 | | mixer_b16_224 | 128 | 0.6548 | 3.2155 | nan | 10.7856 | 37.0102 | 35.4768 | | deit_base_distilled_patch16_224 | 64 | 0.8289 | 4.303 | 6.6094 | 10.4203 | 36.0592 | 34.6956 | | convmixer_768_32 | 32 | 1.0862 | 6.4498 | nan | 13.7196 | 35.8067 | 33.0945 | | fbnetv3_b | 128 | 3.0734 | 11.1026 | 29.9803 | 76.0043 | 35.7771 | 33.8855 | | vit_base_patch16_224 | 64 | 0.8583 | 4.1826 | 6.5315 | 9.6845 | 35.7583 | 35.0589 | | gluon_inception_v3 | 128 | 1.4815 | 8.9849 | nan | 66.9443 | 35.0345 | 32.4497 | | inception_v3 | 128 | 1.4787 | 9.0238 | nan | 67.1459 | 34.8548 | 32.5473 | | adv_inception_v3 | 128 | 1.4876 | 8.9769 | nan | 66.9311 | 34.3905 | 32.5332 | | tf_mixnet_l | 128 | 5.7484 | 13.3541 | nan | 68.7911 | 33.8729 | 32.1963 | | ghostnet_100 | 128 | 2.6432 | 9.6507 | 13.7666 | 58.927 | 32.695 | 30.8681 | | beit_base_patch16_224 | 64 | 1.0871 | 5.6134 | nan | 13.7621 | 32.6318 | 30.8008 | | mixnet_l | 128 | 5.3204 | 12.7271 | nan | 67.9763 | 32.5983 | 31.893 | | dm_nfnet_f0 | 128 | 2.0094 | 7.6042 | nan | 29.9754 | 32.3805 | 29.3454 | | dla102 | 128 | 1.6603 | 10.0975 | nan | 63.1714 | 32.1124 | 30.2312 | | res2next50 | 128 | 1.4989 | 8.7791 | nan | 66.7002 | 29.6202 | 27.9053 | | rexnet_100 | 128 | 1.8062 | 7.4568 | nan | 102.1027 | 26.5523 | 25.3591 | | tinynet_a | 128 | 1.9614 | 8.2078 | 20.2872 | 61.7507 | 25.7941 | 24.6542 | | cspdarknet53 | 64 | 2.2264 | 7.7188 | 20.8213 | 48.0307 | 23.2515 | 22.0433 | | nfnet_l0 | 128 | 1.7245 | 7.5828 | nan | 27.3095 | 23.1165 | 21.8966 | | tf_efficientnet_b0 | 128 | 1.7202 | 6.9673 | nan | 61.9316 | 22.7574 | 21.5149 | | fbnetc_100 | 128 | 1.9567 | 6.9499 | 18.078 | 45.3002 | 21.9517 | 20.7368 | | spnasnet_100 | 128 | 1.9161 | 6.665 | 17.4815 | 43.4797 | 21.4795 | 20.4556 | | mobilenetv3_large_100 | 128 | 1.5899 | 5.5688 | 13.4352 | 64.4429 | 19.9372 | 19.5642 | | mnasnet_100 | 128 | 1.6356 | 5.5127 | 14.0767 | 37.4665 | 18.8558 | 18.0133 | | mobilenetv2_100 | 128 | 1.6442 | 5.4933 | 13.7945 | 37.5793 | 18.5669 | 17.7858 | | gernet_l | 128 | 1.8816 | 6.4469 | 16.2236 | 35.9904 | 18.4345 | 17.2115 | | repvgg_a2 | 128 | 1.8567 | 6.1905 | 15.7371 | 43.751 | 17.9569 | 16.9557 | | regnety_002 | 128 | 1.4855 | 5.8417 | 13.8786 | 46.2472 | 17.8219 | 17.3541 | | selecsls42b | 128 | 0.7717 | 4.0352 | 5.8995 | 39.8612 | 16.4046 | 15.3492 | | lcnet_050 | 128 | 0.9705 | 3.4278 | 7.1291 | 31.167 | 13.6937 | 12.51 | | ese_vovnet19b_dw | 128 | 0.9768 | 3.251 | 6.9304 | 30.8107 | 12.7375 | 11.8284 | | tnt_s_patch16_224 | 128 | 1.4723 | 10.2065 | nan | 22.8828 | nan | 50.0197 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2908 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1793 | 1.2286 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1185 | 1.1745 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8403 | 0.9888 | 1.0866 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9799 | 0.9971 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9611 | 1.0549 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.9576 | 0.9855 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.9187 | 1.0509 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.9068 | 1.0518 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.8417 | 1.0633 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.8404 | 1.0528 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.6496 | 0.8704 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs/huggingface_float32.png : ![](https://i.imgur.com/RAknBx4.png) ../test-dynamo-runner-logs/timm_models_float32.png : ![](https://i.imgur.com/0lqb4db.png) ../test-dynamo-runner-logs/torchbench_float32.png : ![](https://i.imgur.com/Zf2sNxy.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 98%, 60/61  |
|     aot_cudagraphs     | 73%, 40/55 | 47%, 20/43  | 39%, 24/61  |
|      aot_nvfuser       | 58%, 32/55 |  2%, 1/43   | 89%, 54/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 91%, 50/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.48x    |    1.28x    |    1.25x    |
| inductor_no_cudagraphs |   1.22x    |    1.21x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.08    |    2.22     |    1.88     |
|       aot_eager        |    6.92    |    9.05     |    8.70     |
|     aot_cudagraphs     |    8.23    |    18.64    |    15.25    |
|      aot_nvfuser       |   20.32    |    9.60     |    50.01    |
|        inductor        |   62.17    |    52.98    |    73.89    |
| inductor_no_cudagraphs |   64.61    |    49.17    |    72.74    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.82x    |    0.72x    |    0.97x    |
| inductor_no_cudagraphs |   0.94x    |    0.96x    |    1.02x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.818 | 0.9452 | | torchbench | dlrm | 1.0006 | 0.0 | | torchbench | nvidia_deeprecommender | 0.904 | 0.9643 | | torchbench | hf_GPT2_large | 0.0 | 1.3706 | | torchbench | hf_T5 | 0.0 | 1.5515 | | torchbench | tacotron2 | 0.0 | 0.9362 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9499 | 0.9719 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5436 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------+----------+------------------------+ | torchbench | timm_efficientdet | 484.0577 | 488.767 | | torchbench | yolov3 | 419.4861 | 419.8955 | | torchbench | hf_T5_large | 205.3317 | 202.2279 | | torchbench | timm_vision_transformer | 153.43 | 160.5928 | | torchbench | speech_transformer | 152.3735 | 147.9389 | | torchbench | timm_resnest | 150.1654 | 145.0659 | | torchbench | attention_is_all_you_need_pytorch | 137.7387 | 139.7203 | | torchbench | timm_vision_transformer_large | 126.2802 | 123.9619 | | torchbench | dlrm | 3.4517 | nan | | torchbench | hf_GPT2_large | nan | 143.1625 | | torchbench | tacotron2 | nan | 106.378 | | torchbench | hf_T5 | nan | 44.804 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | XGLMForCausalLM | 203.4086 | 201.0863 | | huggingface | DebertaForMaskedLM | 163.7151 | 106.9608 | | huggingface | DebertaForQuestionAnswering | 152.0741 | 118.2059 | | huggingface | M2M100ForConditionalGeneration | 128.0751 | 124.2115 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | twins_pcpvt_base | 431.1592 | 426.4103 | | timm_models | coat_lite_mini | 362.4216 | 372.6703 | | timm_models | mobilevit_s | 233.8428 | 237.9062 | | timm_models | eca_halonext26ts | 204.8437 | 207.0974 | | timm_models | sebotnet33ts_256 | 185.8238 | 191.2608 | | timm_models | eca_botnext26ts_256 | 179.8768 | 176.7545 | | timm_models | swin_base_patch4_window7_224 | 177.0112 | 174.7488 | | timm_models | xcit_large_24_p8_224 | 172.3324 | 164.8544 | | timm_models | jx_nest_base | 155.4547 | 156.5451 | | timm_models | convnext_base | 133.0295 | 129.8216 | | timm_models | cait_m36_384 | 132.7509 | 130.12 | | timm_models | tnt_s_patch16_224 | nan | 50.0197 | +-------------+-----------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 0.9697 | | torchbench | speech_transformer | 0.896 | 0.8996 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.8848 | 0.9654 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.8964 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8616 | 1.0285 | | torchbench | pytorch_unet | 0.859 | 0.8608 | | torchbench | resnet50 | 0.8564 | 0.8913 | | torchbench | densenet121 | 0.8562 | 0.9307 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | hf_Bart | 0.8503 | 1.1284 | | torchbench | fastNLP_Bert | 0.8354 | 1.0952 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.825 | 1.0689 | | torchbench | hf_BigBird | 0.8211 | 1.0393 | | torchbench | dcgan | 0.767 | 0.7903 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | timm_vision_transformer | 0.7478 | 0.8187 | | torchbench | alexnet | 0.743 | 0.8332 | | torchbench | timm_vovnet | 0.7286 | 0.7339 | | torchbench | LearningToPaint | 0.7133 | 0.7462 | | torchbench | hf_Bert | 0.7048 | 0.985 | | torchbench | dlrm | 0.7035 | nan | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | hf_DistilBert | 0.6596 | 0.9466 | | torchbench | vgg16 | 0.6471 | 0.6497 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4682 | 0.6183 | | torchbench | pytorch_struct | 0.4222 | 0.429 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4212 | | torchbench | hf_Reformer | 0.299 | 0.9882 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | tacotron2 | nan | 1.1496 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | BigBird | 0.8224 | 1.0108 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8173 | 0.9383 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.808 | 0.8738 | | huggingface | BartForConditionalGeneration | 0.7817 | 0.9515 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | M2M100ForConditionalGeneration | 0.7712 | 1.016 | | huggingface | GoogleFnet | 0.7698 | 0.9373 | | huggingface | MT5ForConditionalGeneration | 0.7623 | 0.9396 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7492 | 0.9186 | | huggingface | PLBartForConditionalGeneration | 0.7397 | 0.9638 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0246 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9248 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.8801 | | huggingface | OPTForCausalLM | 0.6761 | 0.8847 | | huggingface | ElectraForCausalLM | 0.6731 | 0.905 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8993 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8975 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.4267 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.3264 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 0.9192 | | timm_models | adv_inception_v3 | 0.8983 | 0.9073 | | timm_models | gluon_inception_v3 | 0.8983 | 0.9073 | | timm_models | inception_v3 | 0.8983 | 0.9073 | | timm_models | mnasnet_100 | 0.8961 | 0.9077 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9249 | | timm_models | lcnet_050 | 0.8921 | 0.923 | | timm_models | cspdarknet53 | 0.8835 | 0.8875 | | timm_models | res2net50_14w_8s | 0.881 | 0.9327 | | timm_models | regnety_002 | 0.8617 | 0.8993 | | timm_models | botnet26t_256 | 0.8605 | 0.8702 | | timm_models | pit_b_224 | 0.8417 | 1.0633 | | timm_models | fbnetc_100 | 0.8416 | 0.8498 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9711 | | timm_models | coat_lite_mini | 0.8404 | 1.0528 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.8234 | | timm_models | repvgg_a2 | 0.7684 | 0.8011 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6496 | 0.8704 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs/passrate_over_time.png : ![](https://i.imgur.com/w9yIcLr.png) ../test-dynamo-runner-logs/geomean_over_time.png : ![](https://i.imgur.com/y9jXqqk.png)

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0028 | 0.9993 | 2.3219 | 1.443 | 5.4438 | 1.3058 | | timm_efficientdet | 1 | 0.9824 | 0.8845 | 0.0 | 0.0 | 4.2758 | 1.526 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.9777 | 2.1532 | 1.1969 | 3.6923 | 1.2407 | | timm_vision_transformer | 8 | 1.0068 | 0.9447 | 1.5339 | 1.3578 | 2.5716 | 1.4121 | | drq | 1 | 1.0315 | 0.8503 | 1.3708 | 1.0638 | 2.4195 | 1.0737 | | resnext50_32x4d | 8 | 1.0007 | 1.079 | 1.2092 | 1.3669 | 2.0959 | 1.2162 | | mobilenet_v3_large | 32 | 1.0078 | 1.1087 | 1.0365 | 1.3781 | 1.9864 | 1.3795 | | BERT_pytorch | 16 | 1.0104 | 0.8854 | 0.0 | 0.0 | 1.9168 | 1.9012 | | resnet18 | 16 | 1.006 | 1.1021 | 1.168 | 1.3958 | 1.8428 | 1.2045 | | pytorch_struct | 200 | 0.9977 | 0.7381 | 0.8734 | 0.8906 | 1.827 | 1.1633 | | lennard_jones | 1000 | 0.976 | 0.8293 | 1.0524 | 1.0142 | 1.818 | 0.9452 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.9377 | 1.2471 | 1.1785 | 1.7636 | 1.3013 | | squeezenet1_1 | 32 | 0.9979 | 0.9923 | 1.0527 | 1.1557 | 1.7406 | 1.2709 | | hf_Albert | 8 | 1.0015 | 0.9976 | 0.752 | 0.0 | 1.6466 | 1.6414 | | dcgan | 32 | 0.9829 | 1.0102 | 1.2585 | 1.1788 | 1.6306 | 1.0725 | | hf_T5_large | 2 | 1.0248 | 0.9068 | 0.0 | 0.0 | 1.5833 | 1.5731 | | speech_transformer | 32 | 1.0038 | 0.9068 | 0.0 | 0.0 | 1.5684 | 1.544 | | shufflenet_v2_x1_0 | 128 | 1.0005 | 1.0532 | 0.8062 | 1.1931 | 1.53 | 1.3689 | | timm_resnest | 32 | 0.9996 | 1.0027 | 0.8044 | 1.1815 | 1.5191 | 1.4517 | | timm_nfnet | 128 | 0.9993 | 0.9999 | 0.0 | 1.2122 | 1.4726 | 1.4222 | | mnasnet1_0 | 32 | 0.9993 | 1.0945 | 0.8568 | 1.2932 | 1.4577 | 1.2734 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.978 | 0.0 | 0.0 | 1.4527 | 1.4479 | | mobilenet_v2 | 96 | 0.9998 | 1.0003 | 0.7313 | 1.0443 | 1.4287 | 1.4088 | | hf_GPT2 | 4 | 1.0046 | 0.9827 | 0.738 | 0.0 | 1.4239 | 1.4306 | | soft_actor_critic | 256 | 0.9921 | 0.7715 | 1.1241 | 0.9985 | 1.4185 | 0.9565 | | resnet50_quantized_qat | 32 | 1.0019 | 0.9619 | 0.0 | 0.0 | 1.401 | 1.3947 | | fastNLP_Bert | 6 | 0.9997 | 0.9761 | 0.7528 | 0.0 | 1.3686 | 1.3445 | | timm_efficientnet | 32 | 0.9551 | 0.8076 | 0.7031 | 1.0629 | 1.3353 | 1.2011 | | LearningToPaint | 96 | 1.0048 | 1.0586 | 0.8687 | 1.2057 | 1.2627 | 1.2074 | | pytorch_unet | 1 | 1.0001 | 0.9982 | 0.8464 | 1.0765 | 1.2042 | 1.1861 | | resnet50 | 32 | 0.9994 | 0.9937 | 0.7608 | 1.1612 | 1.204 | 1.1695 | | Super_SloMo | 6 | 1.0003 | 0.9974 | 0.8669 | 0.0 | 1.18 | 1.1645 | | hf_Bart | 4 | 1.0127 | 0.9757 | 0.0 | 0.0 | 1.1721 | 1.1653 | | vgg16 | 64 | 1.0 | 0.999 | 0.859 | 0.9973 | 1.1707 | 1.1652 | | alexnet | 128 | 0.9991 | 0.998 | 0.8031 | 1.0004 | 1.163 | 1.1651 | | hf_Bert | 4 | 1.0214 | 0.944 | 0.7306 | 0.0 | 1.1575 | 1.1396 | | hf_DistilBert | 8 | 0.9999 | 0.9569 | 0.6872 | 0.0 | 1.1481 | 1.1546 | | timm_regnet | 32 | 0.9653 | 0.9617 | 0.7795 | 1.096 | 1.1283 | 1.0941 | | pytorch_stargan | 16 | 0.9997 | 0.983 | 0.866 | 0.9896 | 1.1189 | 1.0913 | | Background_Matting | 4 | 1.0006 | 1.0218 | 0.866 | 1.0816 | 1.1153 | 1.1069 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9267 | 0.0 | 1.1095 | 1.1343 | | hf_BigBird | 2 | 0.9915 | 0.939 | 0.9612 | 0.0 | 1.0921 | 1.0042 | | yolov3 | 16 | 1.0 | 0.9954 | 0.7893 | 1.1839 | 1.0795 | 1.0647 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9726 | 0.0 | 0.0 | 1.047 | 1.033 | | timm_vision_transformer_large | 8 | 0.9982 | 0.9912 | 0.0 | 0.9805 | 1.044 | 1.0331 | | tts_angular | 64 | 0.9937 | 0.964 | 0.9933 | 1.0231 | 1.0136 | 1.0218 | | timm_vovnet | 32 | 0.9102 | 0.9045 | 0.7132 | 0.9774 | 1.0069 | 1.0176 | | dlrm | 2048 | 1.0064 | 1.0734 | 0.0 | 0.0 | 1.0006 | 0.0 | | demucs | 4 | 0.9997 | 0.9998 | 0.999 | 0.9999 | 1.0 | 1.0007 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.585 | 0.942 | 0.904 | 0.9643 | | hf_GPT2_large | 4 | 1.0004 | 0.9805 | 0.0 | 0.0 | 0.0 | 1.3706 | | hf_T5 | 8 | 1.0002 | 0.9932 | 0.0 | 0.0 | 0.0 | 1.5515 | | tacotron2 | 64 | 0.981 | 0.8581 | 0.0 | 0.0 | 0.0 | 0.9362 | | hf_Longformer | 2 | 0.9701 | 0.9013 | 0.8196 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5344 | 38.4011 | nan | nan | 484.0577 | 488.767 | | yolov3 | 16 | 2.7711 | 8.6894 | 11.9084 | 43.4046 | 419.4861 | 419.8955 | | hf_T5_large | 2 | 13.2998 | 41.15 | nan | nan | 205.3317 | 202.2279 | | timm_vision_transformer | 8 | 0.7808 | 4.1474 | 5.8215 | 9.3655 | 153.43 | 160.5928 | | speech_transformer | 32 | 1.5424 | 8.2938 | nan | nan | 152.3735 | 147.9389 | | timm_resnest | 32 | 0.5383 | 2.6812 | 3.7424 | 35.1306 | 150.1654 | 145.0659 | | attention_is_all_you_need_pytorch | 256 | 1.0734 | 7.1292 | nan | nan | 137.7387 | 139.7203 | | timm_vision_transformer_large | 8 | 2.223 | 13.8751 | nan | 24.351 | 126.2802 | 123.9619 | | pytorch_stargan | 16 | 0.3789 | 2.3643 | 3.1326 | 3.9188 | 107.0355 | 104.0851 | | pytorch_struct | 200 | 0.2366 | 0.7827 | 1.3456 | 4.0715 | 99.505 | 98.1575 | | BERT_pytorch | 16 | 1.4194 | 7.614 | nan | nan | 92.0393 | 92.0811 | | fastNLP_Bert | 6 | 1.4306 | 6.6169 | 10.0451 | nan | 65.652 | 63.418 | | hf_GPT2 | 4 | 1.2488 | 6.1179 | 8.8738 | nan | 63.5447 | 63.521 | | hf_Bart | 4 | 1.3924 | 8.089 | nan | nan | 49.9676 | 49.9717 | | densenet121 | 4 | 1.9897 | 13.3477 | 20.1678 | 88.3763 | 45.0957 | 43.7205 | | mobilenet_v3_large | 32 | 0.8275 | 4.8204 | 6.7604 | 53.5764 | 44.9158 | 46.9735 | | hf_Albert | 8 | 1.0066 | 5.8746 | 8.5532 | nan | 41.987 | 41.132 | | hf_BigBird | 2 | 7.3861 | 13.5387 | 29.953 | nan | 41.2734 | 26.6352 | | resnet50_quantized_qat | 32 | 1.061 | 9.0448 | nan | nan | 39.8902 | 40.3176 | | hf_Bert | 4 | 1.312 | 6.2693 | 8.8293 | nan | 39.8395 | 38.7377 | | timm_regnet | 32 | 2.173 | 8.4238 | 20.7651 | 47.6157 | 37.2439 | 35.16 | | hf_Reformer | 4 | 2.3483 | nan | 9.1124 | nan | 36.065 | 30.7238 | | timm_efficientnet | 32 | 1.6787 | 6.665 | 16.1146 | 52.4346 | 34.2419 | 34.4653 | | mnasnet1_0 | 32 | 0.7461 | 4.4921 | 6.4014 | 30.714 | 31.0909 | 30.7546 | | resnet50 | 32 | 0.7937 | 4.9477 | 6.925 | 32.2699 | 31.0875 | 29.832 | | hf_DistilBert | 8 | 0.4278 | 3.0834 | 6.0696 | nan | 30.4362 | 29.5285 | | resnext50_32x4d | 8 | 0.8239 | 4.9203 | 6.8365 | 28.5464 | 30.2931 | 30.0266 | | timm_vovnet | 32 | 1.4222 | 4.5909 | 10.441 | 23.5649 | 30.0127 | 29.7463 | | timm_nfnet | 128 | 1.8844 | 7.7171 | nan | 29.8502 | 29.8712 | 28.8763 | | mobilenet_v2_quantized_qat | 96 | 1.1759 | 8.8754 | nan | nan | 27.0997 | 27.2946 | | functorch_dp_cifar10 | 64 | 0.3232 | 1.9699 | 2.8309 | 5.5366 | 26.1947 | 24.9937 | | resnet18 | 16 | 0.3858 | 1.8912 | 2.6752 | 17.5591 | 23.2902 | 20.4971 | | shufflenet_v2_x1_0 | 128 | 0.8656 | 5.4261 | 7.6883 | 26.8524 | 18.5748 | 17.9867 | | Super_SloMo | 6 | 0.9695 | 5.0542 | 6.7627 | nan | 17.3419 | 16.4668 | | Background_Matting | 4 | 0.6979 | 4.5367 | 6.7144 | 29.2894 | 16.7635 | 16.0163 | | mobilenet_v2 | 96 | 0.7343 | 4.4782 | 6.6781 | 37.1045 | 16.669 | 16.3002 | | pytorch_unet | 1 | 0.4223 | 2.1063 | 2.9975 | 19.6418 | 8.2272 | 7.7305 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3535 | 2.202 | 3.0539 | 3.8439 | 8.1719 | 8.0926 | | LearningToPaint | 96 | 0.4124 | 1.9651 | 2.8324 | 23.8303 | 7.2019 | 6.8944 | | squeezenet1_1 | 32 | 0.2563 | 0.9557 | 1.3863 | 4.5328 | 4.0598 | 3.8616 | | nvidia_deeprecommender | 256 | 0.1895 | 0.4298 | 0.6854 | 2.4393 | 4.0142 | 3.7143 | | drq | 1 | 0.1402 | 0.4424 | 0.8198 | 3.4662 | 3.7694 | 3.1945 | | vgg16 | 64 | 0.1869 | 0.6441 | 1.0464 | 2.4609 | 3.6811 | 3.2422 | | dlrm | 2048 | 0.4444 | 0.8198 | nan | nan | 3.4517 | nan | | soft_actor_critic | 256 | 0.2031 | 0.3372 | 0.4948 | 1.5206 | 3.0611 | 2.6231 | | alexnet | 128 | 0.1421 | 0.4161 | 0.6606 | 2.3558 | 2.9654 | 2.6911 | | dcgan | 32 | 0.1641 | 0.4494 | 0.6683 | 3.7309 | 2.678 | 2.4053 | | lennard_jones | 1000 | 0.1381 | 0.289 | 0.4429 | 1.0648 | 1.9631 | 1.736 | | tts_angular | 64 | 0.2061 | 0.2786 | 0.3976 | 1.0162 | 1.8605 | 1.6749 | | demucs | 4 | 0.2929 | 0.2934 | 0.2977 | 0.2969 | 0.2011 | 0.1967 | | hf_GPT2_large | 4 | 4.9818 | 19.3363 | nan | nan | nan | 143.1625 | | tacotron2 | 64 | 16.7009 | 28.6252 | nan | nan | nan | 106.378 | | hf_T5 | 8 | 2.1787 | 9.4406 | nan | nan | nan | 44.804 | | hf_Longformer | 2 | 5.7342 | 13.862 | 78.3703 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4314 | 1.4314 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.4036 | 1.4036 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2637 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1858 | 1.1912 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | nan | 0.9703 | 1.1094 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.8021 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.896 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.3919 | 0.9169 | 0.8848 | 0.9654 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.8964 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8616 | 1.0285 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3435 | 0.8551 | 0.8562 | 0.9307 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | hf_Bart | 4 | 0.9617 | 0.8598 | nan | nan | 0.8503 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | nan | 0.8354 | 1.0952 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3596 | 0.8203 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4301 | nan | 0.8211 | 1.0393 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.7478 | 0.8187 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6503 | 0.7133 | 0.7462 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7048 | 0.985 | | dlrm | 2048 | 0.7302 | 0.7305 | nan | nan | 0.7035 | nan | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3593 | 0.6971 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6596 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4682 | 0.6183 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1507 | | tacotron2 | 64 | 0.9906 | 1.093 | nan | nan | nan | 1.1496 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0285 | 0.9414 | 0.0 | 0.0 | 3.7345 | 1.5254 | | CamemBert | 1 | 1.0493 | 0.9732 | 1.3251 | 0.0 | 2.3889 | 1.5405 | | MT5ForConditionalGeneration | 8 | 1.0272 | 0.9263 | 0.0 | 0.0 | 2.2531 | 1.9848 | | DistillGPT2 | 1 | 1.0322 | 0.9458 | 1.0657 | 0.0 | 2.099 | 1.9009 | | MobileBertForMaskedLM | 32 | 1.023 | 0.9232 | 0.0 | 0.0 | 1.9829 | 1.574 | | GoogleFnet | 1 | 0.9985 | 0.8173 | 0.9815 | 1.1247 | 1.9188 | 1.1214 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.6662 | 1.6568 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9667 | 0.0 | 0.0 | 1.4388 | 1.4275 | | M2M100ForConditionalGeneration | 8 | 1.0412 | 0.8942 | 1.0013 | 0.0 | 1.4178 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.024 | 0.9187 | 0.0 | 0.0 | 1.4036 | 1.2789 | | ElectraForCausalLM | 32 | 1.0004 | 0.9312 | 0.0 | 0.0 | 1.3702 | 1.4028 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9844 | 0.0 | 0.0 | 1.3541 | 1.3368 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0018 | 0.0 | 0.0 | 1.2567 | 1.2522 | | AlbertForMaskedLM | 4 | 0.9993 | 0.9996 | 0.0 | 0.0 | 1.25 | 1.2519 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9892 | 0.7379 | 0.0 | 1.2473 | 1.2318 | | T5Small | 1 | 1.0191 | 0.9543 | 0.0 | 0.0 | 1.2442 | 1.2308 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9613 | 0.0 | 0.0 | 1.1874 | 1.188 | | OPTForCausalLM | 32 | 1.0037 | 0.932 | 0.0 | 0.0 | 1.1825 | 1.1983 | | XGLMForCausalLM | 8 | 1.0128 | 0.9394 | 0.0 | 0.0 | 1.1706 | 1.1753 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.971 | 0.0 | 0.0 | 1.1633 | 1.1716 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.985 | 0.7131 | 0.0 | 1.1444 | 1.1262 | | RobertaForCausalLM | 64 | 1.0004 | 0.9637 | 0.7465 | 0.0 | 1.1133 | 1.1212 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.9259 | 0.6593 | 0.0 | 1.11 | 1.1484 | | BigBird | 1 | 0.9894 | 0.937 | 0.991 | 0.0 | 1.1023 | 1.0034 | | BartForCausalLM | 4 | 1.0007 | 0.9668 | 0.0 | 0.0 | 1.0962 | 1.1067 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9887 | 0.0 | 0.0 | 1.0962 | 1.0896 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0104 | 0.7572 | 0.0 | 1.0947 | 1.0716 | | MBartForConditionalGeneration | 16 | 1.0102 | 0.9766 | 0.0 | 0.0 | 1.0887 | 1.0775 | | DebertaForMaskedLM | 4 | 0.9321 | 0.8111 | 0.7317 | 0.0 | 1.0885 | 1.0732 | | MegatronBertForCausalLM | 16 | 1.0332 | 1.0027 | 0.7578 | 0.0 | 1.087 | 1.0785 | | PegasusForConditionalGeneration | 16 | 1.0101 | 0.9819 | 0.7569 | 0.0 | 1.0857 | 1.0825 | | BertForQuestionAnswering | 128 | 0.9997 | 0.9882 | 0.0 | 0.0 | 1.0722 | 1.0661 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9942 | 0.0 | 0.0 | 1.0696 | 1.0709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9265 | 0.0 | 0.0 | 1.0628 | 1.0696 | | DebertaForQuestionAnswering | 8 | 0.9976 | 0.9917 | 0.6821 | 0.0 | 1.0623 | 1.2025 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9519 | 0.7122 | 0.0 | 1.0362 | 1.0546 | | BertForMaskedLM | 64 | 1.0003 | 0.9524 | 0.7302 | 0.0 | 1.0338 | 1.0381 | | PLBartForCausalLM | 32 | 1.0055 | 0.9348 | 0.7321 | 0.0 | 1.0224 | 1.0494 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9105 | 0.6827 | 0.0 | 1.0131 | 1.0345 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9556 | 0.0 | 0.0 | 0.9981 | 1.0096 | | MBartForCausalLM | 32 | 1.0013 | 0.9555 | 0.0 | 0.0 | 0.9967 | 1.0069 | | PegasusForCausalLM | 32 | 0.9998 | 0.953 | 0.7325 | 0.0 | 0.9888 | 1.0008 | | AllenaiLongformerBase | 1 | 0.953 | 0.7915 | 0.7884 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2364 | 12.2125 | nan | nan | 203.4086 | 201.0863 | | DebertaForMaskedLM | 4 | 4.684 | 11.0814 | 44.7781 | nan | 163.7151 | 106.9608 | | DebertaForQuestionAnswering | 8 | 4.5483 | 11.6349 | 43.993 | nan | 152.0741 | 118.2059 | | M2M100ForConditionalGeneration | 8 | 2.7543 | 15.4794 | 23.643 | nan | 128.0751 | 124.2115 | | YituTechConvBert | 1 | 2.0946 | 9.5284 | nan | nan | 115.4649 | 119.3641 | | MT5ForConditionalGeneration | 8 | 3.4744 | 13.6659 | nan | nan | 90.4534 | 91.1223 | | MobileBertForMaskedLM | 32 | 7.7855 | 27.1609 | nan | nan | 88.9601 | 85.7795 | | MobileBertForQuestionAnswering | 64 | 7.9327 | 27.5186 | nan | nan | 74.7874 | 71.876 | | MegatronBertForCausalLM | 16 | 3.0219 | 12.5327 | 19.6699 | nan | 61.5191 | 59.8845 | | MegatronBertForQuestionAnswering | 16 | 3.0691 | 13.2977 | 19.1034 | nan | 60.2609 | 58.2808 | | LayoutLMForSequenceClassification | 16 | 1.6734 | 6.6917 | 10.1343 | nan | 59.7267 | 60.187 | | T5ForConditionalGeneration | 4 | 2.1399 | 8.8895 | nan | nan | 58.3394 | 57.0848 | | PegasusForConditionalGeneration | 16 | 2.6227 | 14.7158 | 24.2283 | nan | 58.1897 | 54.3056 | | BartForConditionalGeneration | 2 | 2.8248 | 15.0065 | nan | nan | 57.0652 | 54.7753 | | T5Small | 1 | 2.1902 | 8.9903 | nan | nan | 55.4364 | 53.2137 | | MBartForConditionalGeneration | 16 | 2.7868 | 15.512 | nan | nan | 54.3119 | 53.1455 | | PLBartForConditionalGeneration | 16 | 1.3887 | 8.298 | nan | nan | 47.5246 | 46.3964 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7139 | 10.0168 | nan | nan | 43.6075 | 41.5748 | | BigBird | 1 | 7.296 | 13.5333 | 29.6711 | nan | 40.7238 | 26.8699 | | ElectraForCausalLM | 32 | 1.2891 | 6.2441 | nan | nan | 40.6712 | 39.969 | | DistillGPT2 | 1 | 0.6422 | 3.1221 | 4.4918 | nan | 33.8479 | 32.6814 | | LayoutLMForMaskedLM | 16 | 1.6131 | 6.6316 | nan | nan | 32.8126 | 32.5964 | | BertForMaskedLM | 64 | 1.2973 | 6.3901 | 9.4361 | nan | 32.777 | 31.6779 | | ElectraForQuestionAnswering | 64 | 1.3222 | 6.4111 | nan | nan | 32.5117 | 31.4854 | | GPT2ForSequenceClassification | 4 | 1.2751 | 6.1953 | nan | nan | 32.0765 | 31.1399 | | RobertaForCausalLM | 64 | 1.3104 | 6.1902 | 9.2915 | nan | 28.0396 | 27.4422 | | BertForQuestionAnswering | 128 | 1.3166 | 6.2802 | nan | nan | 27.7294 | 27.1936 | | PegasusForCausalLM | 32 | 1.0161 | 5.707 | 8.775 | nan | 27.1087 | 25.1376 | | MBartForCausalLM | 32 | 0.9522 | 5.5767 | nan | nan | 25.4243 | 24.6154 | | RobertaForQuestionAnswering | 128 | 1.3205 | 6.387 | nan | nan | 24.5494 | 23.8515 | | TrOCRForCausalLM | 32 | 0.9241 | 5.5701 | nan | nan | 24.4333 | 24.1797 | | BartForCausalLM | 4 | 1.0079 | 5.6176 | nan | nan | 24.3593 | 23.6588 | | AlbertForMaskedLM | 4 | 1.1157 | 5.8703 | nan | nan | 23.8611 | 23.0601 | | GoogleFnet | 1 | 0.7904 | 3.3495 | 10.4595 | 9.6049 | 23.8114 | 16.1369 | | BlenderbotSmallForCausalLM | 64 | 0.6439 | 3.7467 | 5.6889 | nan | 23.625 | 22.6972 | | DistilBertForMaskedLM | 64 | 0.4729 | 2.9552 | 5.8879 | nan | 23.0127 | 22.634 | | AlbertForQuestionAnswering | 4 | 1.1461 | 5.9483 | nan | nan | 22.7287 | 21.5179 | | OPTForCausalLM | 32 | 1.0353 | 5.881 | nan | nan | 21.8562 | 20.7457 | | DistilBertForQuestionAnswering | 64 | 0.4816 | 3.0171 | 5.9235 | nan | 21.8186 | 22.1039 | | CamemBert | 1 | 1.38 | 6.1479 | 8.5874 | nan | 21.7413 | 21.2151 | | Speech2Text2ForCausalLM | 128 | 0.577 | 2.9045 | 4.6098 | nan | 19.6271 | 18.24 | | PLBartForCausalLM | 32 | 0.4938 | 2.9552 | 4.3734 | nan | 18.8954 | 18.2071 | | AllenaiLongformerBase | 1 | 5.9078 | 14.4262 | 80.0409 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0596 | 1.1223 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0108 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9984 | 0.8218 | 0.3795 | nan | 0.8173 | 0.9383 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | nan | nan | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8198 | nan | nan | 0.808 | 0.8738 | | BartForConditionalGeneration | 2 | 1.0 | 0.893 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | M2M100ForConditionalGeneration | 8 | 1.007 | 0.9507 | 0.3799 | nan | 0.7712 | 1.016 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3614 | nan | 0.7492 | 0.9186 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8743 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0246 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | nan | 0.6775 | 0.8801 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.6731 | 0.905 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8993 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | nan | 0.6375 | 0.8975 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | nan | 0.4267 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9956 | 0.8421 | 1.2485 | 1.8144 | 1.7733 | | lcnet_050 | 128 | 0.9568 | 0.9489 | 0.7675 | 1.4962 | 1.6425 | 1.6316 | | coat_lite_mini | 128 | 1.0 | 1.0 | 0.8447 | 1.0566 | 1.6056 | 1.5895 | | regnety_002 | 128 | 0.9778 | 0.9844 | 0.8615 | 1.3561 | 1.4813 | 1.3447 | | dm_nfnet_f0 | 128 | 1.0 | 1.0003 | 0.0 | 1.2124 | 1.4725 | 1.422 | | xcit_large_24_p8_224 | 5 | 1.003 | 1.0032 | 0.0 | 0.0 | 1.4529 | 1.4094 | | hrnet_w18 | 128 | 0.9999 | 0.9985 | 0.0 | 1.3201 | 1.418 | 1.3775 | | volo_d1_224 | 64 | 0.9999 | 0.9959 | 0.0 | 1.1295 | 1.3859 | 1.3634 | | dla102 | 128 | 1.0002 | 1.0008 | 0.0 | 1.2853 | 1.3821 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7891 | 0.0 | 1.0518 | 1.3733 | 1.3288 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2307 | 1.3564 | 1.3208 | | mobilenetv2_100 | 128 | 0.9662 | 0.9648 | 0.7065 | 1.0145 | 1.3373 | 1.3526 | | mobilenetv3_large_100 | 128 | 0.9664 | 0.9632 | 0.7654 | 1.1624 | 1.3356 | 1.3413 | | crossvit_9_240 | 128 | 0.9999 | 0.9988 | 0.0 | 1.0243 | 1.3305 | 1.3051 | | adv_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1253 | 1.328 | 1.3083 | | gluon_inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1224 | 1.3249 | 1.3075 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1257 | 1.3244 | 1.3076 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.166 | 1.3121 | 1.2748 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 1.1963 | 1.3115 | 1.2714 | | gmixer_24_224 | 128 | 0.9999 | 0.8348 | 0.0 | 0.98 | 1.2974 | 1.2696 | | fbnetv3_b | 128 | 0.9642 | 0.9614 | 0.7623 | 1.1326 | 1.283 | 1.2951 | | botnet26t_256 | 128 | 0.9851 | 0.9857 | 0.7892 | 1.2271 | 1.2742 | 1.2801 | | jx_nest_base | 32 | 0.9998 | 0.9926 | 0.0 | 1.217 | 1.2725 | 1.2481 | | sebotnet33ts_256 | 64 | 0.9753 | 0.8072 | 0.0 | 1.0528 | 1.2706 | 1.2762 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7721 | 0.0 | 1.0301 | 1.2706 | 1.2477 | | selecsls42b | 128 | 0.9998 | 0.9991 | 0.8157 | 1.2083 | 1.2671 | 1.2514 | | tf_efficientnet_b0 | 128 | 0.9776 | 0.7843 | 0.0 | 0.9848 | 1.2613 | 1.2686 | | mnasnet_100 | 128 | 0.9663 | 0.9639 | 0.7855 | 1.1575 | 1.2598 | 1.2787 | | eca_halonext26ts | 128 | 0.9877 | 0.7787 | 0.0 | 1.0289 | 1.2502 | 1.2494 | | fbnetc_100 | 128 | 0.967 | 0.9622 | 0.7908 | 1.1879 | 1.2497 | 1.2635 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9777 | 0.7445 | 1.1452 | 1.2404 | 1.2461 | | spnasnet_100 | 128 | 0.9605 | 0.9573 | 0.7734 | 1.1366 | 1.2375 | 1.2543 | | cspdarknet53 | 64 | 0.9581 | 0.9526 | 0.7322 | 1.1835 | 1.2287 | 1.2391 | | res2net101_26w_4s | 64 | 0.9997 | 0.9972 | 0.7705 | 1.1739 | 1.2283 | 1.1885 | | convit_base | 64 | 0.9998 | 0.9992 | 0.0 | 1.195 | 1.2216 | 1.2164 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.055 | 1.221 | 1.211 | | gmlp_s16_224 | 128 | 1.0 | 0.9994 | 0.0 | 0.9989 | 1.2164 | 1.2053 | | rexnet_100 | 128 | 0.9723 | 0.8169 | 0.0 | 0.9835 | 1.2142 | 1.2193 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.0838 | 1.2112 | 1.1932 | | tinynet_a | 128 | 0.9659 | 0.7757 | 0.6205 | 0.9713 | 1.1925 | 1.1949 | | cait_m36_384 | 4 | 0.9998 | 0.0 | 0.0 | 0.0 | 1.1826 | 1.158 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 1.0177 | 1.173 | 1.1697 | | dpn107 | 32 | 0.958 | 0.9367 | 0.7817 | 1.0288 | 1.1726 | 1.202 | | mobilevit_s | 64 | 0.9792 | 0.762 | 0.0 | 0.9468 | 1.1702 | 1.1666 | | repvgg_a2 | 128 | 0.9641 | 0.9623 | 0.8288 | 1.1224 | 1.1692 | 1.1652 | | poolformer_m36 | 64 | 0.9998 | 0.9993 | 0.0 | 0.0 | 1.1661 | 1.1475 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0185 | 1.1534 | 1.1505 | | twins_pcpvt_base | 64 | 1.0001 | 0.9974 | 0.75 | 1.0624 | 1.148 | 1.1172 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9785 | 0.0 | 0.9932 | 1.1469 | 1.1322 | | convnext_base | 64 | 0.9999 | 0.9988 | 0.0 | 1.0441 | 1.1157 | 1.1262 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9801 | 0.0 | 0.9504 | 1.1141 | 1.1053 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9988 | 0.0 | 1.1071 | 1.1068 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9995 | 0.7673 | 1.0156 | 1.0955 | 1.0834 | | gluon_xception65 | 32 | 0.9998 | 0.9975 | 0.0 | 1.0403 | 1.0871 | 1.0759 | | vit_base_patch16_224 | 64 | 1.0002 | 0.999 | 0.7662 | 0.9763 | 1.0855 | 1.0734 | | mixer_b16_224 | 128 | 1.0006 | 1.0001 | 0.0 | 0.9771 | 1.0808 | 1.0736 | | convmixer_768_32 | 32 | 0.9999 | 1.0002 | 0.0 | 1.0615 | 1.0783 | 1.0744 | | gernet_l | 128 | 0.9744 | 0.9723 | 0.8239 | 1.0992 | 1.075 | 1.0704 | | visformer_small | 128 | 1.0001 | 1.0022 | 0.797 | 1.0217 | 1.0495 | 1.0162 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6956 | 0.0 | 0.9499 | 0.9719 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 1.6263 | 0.0 | 1.5436 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.064 | 13.0072 | 21.5012 | 42.855 | 431.1592 | 426.4103 | | coat_lite_mini | 128 | 1.0194 | 5.4653 | 7.961 | 14.7686 | 362.4216 | 372.6703 | | mobilevit_s | 64 | 1.5683 | 7.1641 | nan | 42.4621 | 233.8428 | 237.9062 | | eca_halonext26ts | 128 | 1.4144 | 5.4751 | nan | 55.2357 | 204.8437 | 207.0974 | | sebotnet33ts_256 | 64 | 1.7651 | 6.6709 | nan | 51.039 | 185.8238 | 191.2608 | | eca_botnext26ts_256 | 128 | 1.3797 | 5.2911 | nan | 52.9221 | 179.8768 | 176.7545 | | swin_base_patch4_window7_224 | 64 | 2.5123 | 12.7354 | nan | 58.0591 | 177.0112 | 174.7488 | | xcit_large_24_p8_224 | 5 | 2.603 | 17.1709 | nan | nan | 172.3324 | 164.8544 | | jx_nest_base | 32 | 1.6708 | 9.2321 | nan | 57.8786 | 155.4547 | 156.5451 | | convnext_base | 64 | 1.2341 | 5.9929 | nan | 20.8438 | 133.0295 | 129.8216 | | cait_m36_384 | 4 | 2.6486 | nan | nan | nan | 132.7509 | 130.12 | | hrnet_w18 | 128 | 5.6217 | 31.9848 | nan | 251.7181 | 106.8258 | 100.7524 | | botnet26t_256 | 128 | 1.3057 | 4.4635 | 10.0598 | 40.2751 | 106.2411 | 103.5341 | | crossvit_9_240 | 128 | 1.3396 | 7.9862 | nan | 27.0701 | 97.9064 | 96.8689 | | resnest101e | 64 | 2.998 | 16.9945 | nan | 78.2291 | 93.9541 | 89.7619 | | pnasnet5large | 16 | 4.1626 | 22.9703 | nan | 123.7628 | 87.4338 | 84.1545 | | volo_d1_224 | 64 | 1.1595 | 7.6273 | nan | 28.0879 | 85.2424 | 83.6849 | | gmlp_s16_224 | 128 | 0.9511 | 6.2939 | nan | 13.365 | 71.7498 | 69.4367 | | visformer_small | 128 | 0.9009 | 4.189 | 6.2793 | 24.3038 | 71.1462 | 69.6831 | | pit_b_224 | 64 | 0.9339 | 4.8631 | nan | 12.5251 | 66.2774 | 65.1378 | | res2net101_26w_4s | 64 | 2.9852 | 17.3432 | 28.4155 | 80.897 | 55.6027 | 52.0513 | | gmixer_24_224 | 128 | 1.0133 | 7.3092 | nan | 16.5474 | 51.9895 | 50.5586 | | convit_base | 64 | 0.9843 | 5.9421 | nan | 18.0525 | 50.9922 | 49.952 | | res2net50_14w_8s | 128 | 2.5693 | 15.6494 | nan | 98.8662 | 50.8157 | 49.7271 | | gluon_xception65 | 32 | 1.6885 | 11.1965 | nan | 41.7582 | 49.2318 | 45.5937 | | poolformer_m36 | 64 | 1.8121 | 9.7062 | nan | nan | 47.0371 | 44.6651 | | resmlp_12_224 | 128 | 0.6088 | 2.794 | 5.5064 | nan | 42.3381 | 38.0426 | | swsl_resnext101_32x16d | 32 | 1.6289 | 10.0288 | nan | 39.6141 | 41.9677 | 41.3616 | | dpn107 | 32 | 3.7727 | 14.7274 | 45.6394 | 76.1359 | 40.3245 | 37.6555 | | mixer_b16_224 | 128 | 0.6548 | 3.2155 | nan | 10.7856 | 37.0102 | 35.4768 | | deit_base_distilled_patch16_224 | 64 | 0.8289 | 4.303 | 6.6094 | 10.4203 | 36.0592 | 34.6956 | | convmixer_768_32 | 32 | 1.0862 | 6.4498 | nan | 13.7196 | 35.8067 | 33.0945 | | fbnetv3_b | 128 | 3.0734 | 11.1026 | 29.9803 | 76.0043 | 35.7771 | 33.8855 | | vit_base_patch16_224 | 64 | 0.8583 | 4.1826 | 6.5315 | 9.6845 | 35.7583 | 35.0589 | | gluon_inception_v3 | 128 | 1.4815 | 8.9849 | nan | 66.9443 | 35.0345 | 32.4497 | | inception_v3 | 128 | 1.4787 | 9.0238 | nan | 67.1459 | 34.8548 | 32.5473 | | adv_inception_v3 | 128 | 1.4876 | 8.9769 | nan | 66.9311 | 34.3905 | 32.5332 | | tf_mixnet_l | 128 | 5.7484 | 13.3541 | nan | 68.7911 | 33.8729 | 32.1963 | | ghostnet_100 | 128 | 2.6432 | 9.6507 | 13.7666 | 58.927 | 32.695 | 30.8681 | | beit_base_patch16_224 | 64 | 1.0871 | 5.6134 | nan | 13.7621 | 32.6318 | 30.8008 | | mixnet_l | 128 | 5.3204 | 12.7271 | nan | 67.9763 | 32.5983 | 31.893 | | dm_nfnet_f0 | 128 | 2.0094 | 7.6042 | nan | 29.9754 | 32.3805 | 29.3454 | | dla102 | 128 | 1.6603 | 10.0975 | nan | 63.1714 | 32.1124 | 30.2312 | | res2next50 | 128 | 1.4989 | 8.7791 | nan | 66.7002 | 29.6202 | 27.9053 | | rexnet_100 | 128 | 1.8062 | 7.4568 | nan | 102.1027 | 26.5523 | 25.3591 | | tinynet_a | 128 | 1.9614 | 8.2078 | 20.2872 | 61.7507 | 25.7941 | 24.6542 | | cspdarknet53 | 64 | 2.2264 | 7.7188 | 20.8213 | 48.0307 | 23.2515 | 22.0433 | | nfnet_l0 | 128 | 1.7245 | 7.5828 | nan | 27.3095 | 23.1165 | 21.8966 | | tf_efficientnet_b0 | 128 | 1.7202 | 6.9673 | nan | 61.9316 | 22.7574 | 21.5149 | | fbnetc_100 | 128 | 1.9567 | 6.9499 | 18.078 | 45.3002 | 21.9517 | 20.7368 | | spnasnet_100 | 128 | 1.9161 | 6.665 | 17.4815 | 43.4797 | 21.4795 | 20.4556 | | mobilenetv3_large_100 | 128 | 1.5899 | 5.5688 | 13.4352 | 64.4429 | 19.9372 | 19.5642 | | mnasnet_100 | 128 | 1.6356 | 5.5127 | 14.0767 | 37.4665 | 18.8558 | 18.0133 | | mobilenetv2_100 | 128 | 1.6442 | 5.4933 | 13.7945 | 37.5793 | 18.5669 | 17.7858 | | gernet_l | 128 | 1.8816 | 6.4469 | 16.2236 | 35.9904 | 18.4345 | 17.2115 | | repvgg_a2 | 128 | 1.8567 | 6.1905 | 15.7371 | 43.751 | 17.9569 | 16.9557 | | regnety_002 | 128 | 1.4855 | 5.8417 | 13.8786 | 46.2472 | 17.8219 | 17.3541 | | selecsls42b | 128 | 0.7717 | 4.0352 | 5.8995 | 39.8612 | 16.4046 | 15.3492 | | lcnet_050 | 128 | 0.9705 | 3.4278 | 7.1291 | 31.167 | 13.6937 | 12.51 | | ese_vovnet19b_dw | 128 | 0.9768 | 3.251 | 6.9304 | 30.8107 | 12.7375 | 11.8284 | | tnt_s_patch16_224 | 128 | 1.4723 | 10.2065 | nan | 22.8828 | nan | 50.0197 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2908 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1793 | 1.2286 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1185 | 1.1745 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8403 | 0.9888 | 1.0866 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9799 | 0.9971 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9611 | 1.0549 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.9576 | 0.9855 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.9187 | 1.0509 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.9068 | 1.0518 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.8417 | 1.0633 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.8404 | 1.0528 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.6496 | 0.8704 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs/huggingface_float32.png : ![](https://i.imgur.com/gV40GxJ.png) ../test-dynamo-runner-logs/timm_models_float32.png : ![](https://i.imgur.com/bOZnrbr.png) ../test-dynamo-runner-logs/torchbench_float32.png : ![](https://i.imgur.com/IahBxx3.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 71%, 54/76 | 51%, 43/84  | 76%, 61/80  |
|       aot_eager        | 70%, 53/76 | 51%, 43/84  | 75%, 60/80  |
|     aot_cudagraphs     | 53%, 40/76 | 24%, 20/84  | 30%, 24/80  |
|      aot_nvfuser       | 43%, 33/76 |  1%, 1/84   | 71%, 57/80  |
|        inductor        | 66%, 50/76 | 50%, 42/84  | 75%, 60/80  |
| inductor_no_cudagraphs | 68%, 52/76 | 50%, 42/84  | 76%, 61/80  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.09x    |    1.02x    |    1.00x    |
|      aot_nvfuser       |   1.13x    |    1.12x    |    1.11x    |
|        inductor        |   1.47x    |    1.28x    |    1.25x    |
| inductor_no_cudagraphs |   1.23x    |    1.21x    |    1.24x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.08    |    2.22     |    1.88     |
|       aot_eager        |    6.96    |    9.05     |    8.70     |
|     aot_cudagraphs     |    8.23    |    18.64    |    15.25    |
|      aot_nvfuser       |   21.02    |    9.60     |    49.80    |
|        inductor        |   61.02    |    52.88    |    73.59    |
| inductor_no_cudagraphs |   63.42    |    49.20    |    71.75    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.96x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.86x    |    0.91x    |    0.88x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.32x    |
|      aot_nvfuser       |   0.83x    |    1.08x    |    0.84x    |
|        inductor        |   0.85x    |    0.72x    |    0.97x    |
| inductor_no_cudagraphs |   0.96x    |    0.96x    |    1.02x    |
+------------------------+------------+-------------+-------------+

Warnings

Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.818 | 0.9452 | | torchbench | dlrm | 1.0006 | 0.0 | | torchbench | nvidia_deeprecommender | 0.904 | 0.9643 | | torchbench | hf_GPT2_large | 0.0 | 1.3706 | | torchbench | hf_T5 | 0.0 | 1.5515 | | torchbench | tacotron2 | 0.0 | 0.9362 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.9499 | 0.9719 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5436 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------+----------+------------------------+ | torchbench | timm_efficientdet | 484.0577 | 488.767 | | torchbench | yolov3 | 419.4861 | 419.8955 | | torchbench | hf_T5_large | 205.3317 | 202.2279 | | torchbench | timm_vision_transformer | 153.43 | 160.5928 | | torchbench | speech_transformer | 152.3735 | 147.9389 | | torchbench | timm_resnest | 150.1654 | 145.0659 | | torchbench | attention_is_all_you_need_pytorch | 137.7387 | 139.7203 | | torchbench | timm_vision_transformer_large | 126.2802 | 123.9619 | | torchbench | dlrm | 3.4517 | nan | | torchbench | hf_GPT2_large | nan | 143.1625 | | torchbench | tacotron2 | nan | 106.378 | | torchbench | hf_T5 | nan | 44.804 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | XGLMForCausalLM | 203.4086 | 201.0863 | | huggingface | DebertaForMaskedLM | 163.7151 | 106.9608 | | huggingface | DebertaForQuestionAnswering | 152.0741 | 118.2059 | | huggingface | M2M100ForConditionalGeneration | 128.0751 | 124.2115 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | twins_pcpvt_base | 431.1592 | 426.4103 | | timm_models | coat_lite_mini | 362.4216 | 372.6703 | | timm_models | mobilevit_s | 233.8428 | 237.9062 | | timm_models | eca_halonext26ts | 204.8437 | 207.0974 | | timm_models | sebotnet33ts_256 | 185.8238 | 191.2608 | | timm_models | eca_botnext26ts_256 | 179.8768 | 176.7545 | | timm_models | swin_base_patch4_window7_224 | 177.0112 | 174.7488 | | timm_models | xcit_large_24_p8_224 | 172.3324 | 164.8544 | | timm_models | jx_nest_base | 155.4547 | 156.5451 | | timm_models | convnext_base | 133.0295 | 129.8216 | | timm_models | cait_m36_384 | 132.7509 | 130.12 | | timm_models | tnt_s_patch16_224 | nan | 50.0197 | +-------------+-----------------------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 0.9697 | | torchbench | speech_transformer | 0.896 | 0.8996 | | torchbench | pytorch_CycleGAN_and_pix2pix | 0.8848 | 0.9654 | | torchbench | hf_Albert | 0.8836 | 1.2215 | | torchbench | mobilenet_v3_large | 0.8829 | 0.8964 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8616 | 1.0285 | | torchbench | pytorch_unet | 0.859 | 0.8608 | | torchbench | resnet50 | 0.8564 | 0.8913 | | torchbench | densenet121 | 0.8562 | 0.9307 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | hf_Bart | 0.8503 | 1.1284 | | torchbench | fastNLP_Bert | 0.8354 | 1.0952 | | torchbench | resnext50_32x4d | 0.8303 | 0.8352 | | torchbench | BERT_pytorch | 0.825 | 1.0689 | | torchbench | hf_BigBird | 0.8211 | 1.0393 | | torchbench | dcgan | 0.767 | 0.7903 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | timm_vision_transformer | 0.7478 | 0.8187 | | torchbench | alexnet | 0.743 | 0.8332 | | torchbench | timm_vovnet | 0.7286 | 0.7339 | | torchbench | LearningToPaint | 0.7133 | 0.7462 | | torchbench | hf_Bert | 0.7048 | 0.985 | | torchbench | dlrm | 0.7035 | nan | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | hf_DistilBert | 0.6596 | 0.9466 | | torchbench | vgg16 | 0.6471 | 0.6497 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | attention_is_all_you_need_pytorch | 0.4682 | 0.6183 | | torchbench | pytorch_struct | 0.4222 | 0.429 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4212 | | torchbench | hf_Reformer | 0.299 | 0.9882 | | torchbench | hf_T5 | nan | 1.1507 | | torchbench | tacotron2 | nan | 1.1496 | | torchbench | hf_GPT2_large | nan | 1.1258 | | torchbench | hf_Longformer | nan | nan | | torchbench | moco | nan | nan | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8564 | 1.0758 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | BigBird | 0.8224 | 1.0108 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | DistillGPT2 | 0.8173 | 0.9383 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | YituTechConvBert | 0.808 | 0.8738 | | huggingface | BartForConditionalGeneration | 0.7817 | 0.9515 | | huggingface | PegasusForCausalLM | 0.7774 | 0.9692 | | huggingface | M2M100ForConditionalGeneration | 0.7712 | 1.016 | | huggingface | GoogleFnet | 0.7698 | 0.9373 | | huggingface | MT5ForConditionalGeneration | 0.7623 | 0.9396 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7492 | 0.9186 | | huggingface | PLBartForConditionalGeneration | 0.7397 | 0.9638 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0246 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9248 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.8801 | | huggingface | OPTForCausalLM | 0.6761 | 0.8847 | | huggingface | ElectraForCausalLM | 0.6731 | 0.905 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8993 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8975 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.4267 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.3264 | 1.1588 | | huggingface | AllenaiLongformerBase | nan | nan | | timm_models | selecsls42b | 0.899 | 0.9192 | | timm_models | adv_inception_v3 | 0.8983 | 0.9073 | | timm_models | gluon_inception_v3 | 0.8983 | 0.9073 | | timm_models | inception_v3 | 0.8983 | 0.9073 | | timm_models | mnasnet_100 | 0.8961 | 0.9077 | | timm_models | swsl_resnext101_32x16d | 0.8931 | 0.9249 | | timm_models | lcnet_050 | 0.8921 | 0.923 | | timm_models | cspdarknet53 | 0.8835 | 0.8875 | | timm_models | res2net50_14w_8s | 0.881 | 0.9327 | | timm_models | regnety_002 | 0.8617 | 0.8993 | | timm_models | botnet26t_256 | 0.8605 | 0.8702 | | timm_models | pit_b_224 | 0.8417 | 1.0633 | | timm_models | fbnetc_100 | 0.8416 | 0.8498 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9711 | | timm_models | coat_lite_mini | 0.8404 | 1.0528 | | timm_models | resmlp_12_224 | 0.8169 | 0.8253 | | timm_models | gernet_l | 0.7928 | 0.8234 | | timm_models | repvgg_a2 | 0.7684 | 0.8011 | | timm_models | convit_base | 0.7463 | 0.9008 | | timm_models | crossvit_9_240 | 0.6496 | 0.8704 | | timm_models | tnt_s_patch16_224 | nan | 0.8623 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | densenet121 | 4 | 1.0028 | 0.9993 | 2.3219 | 1.443 | 5.4438 | 1.3058 | | timm_efficientdet | 1 | 0.9824 | 0.8845 | 0.0 | 0.0 | 4.2758 | 1.526 | | functorch_dp_cifar10 | 64 | 1.0024 | 0.9777 | 2.1532 | 1.1969 | 3.6923 | 1.2407 | | timm_vision_transformer | 8 | 1.0068 | 0.9447 | 1.5339 | 1.3578 | 2.5716 | 1.4121 | | drq | 1 | 1.0315 | 0.8503 | 1.3708 | 1.0638 | 2.4195 | 1.0737 | | resnext50_32x4d | 8 | 1.0007 | 1.079 | 1.2092 | 1.3669 | 2.0959 | 1.2162 | | mobilenet_v3_large | 32 | 1.0078 | 1.1087 | 1.0365 | 1.3781 | 1.9864 | 1.3795 | | BERT_pytorch | 16 | 1.0104 | 0.8854 | 0.0 | 0.0 | 1.9168 | 1.9012 | | resnet18 | 16 | 1.006 | 1.1021 | 1.168 | 1.3958 | 1.8428 | 1.2045 | | pytorch_struct | 200 | 0.9977 | 0.7381 | 0.8734 | 0.8906 | 1.827 | 1.1633 | | lennard_jones | 1000 | 0.976 | 0.8293 | 1.0524 | 1.0142 | 1.818 | 0.9452 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 0.9377 | 1.2471 | 1.1785 | 1.7636 | 1.3013 | | squeezenet1_1 | 32 | 0.9979 | 0.9923 | 1.0527 | 1.1557 | 1.7406 | 1.2709 | | hf_Albert | 8 | 1.0015 | 0.9976 | 0.752 | 0.0 | 1.6466 | 1.6414 | | dcgan | 32 | 0.9829 | 1.0102 | 1.2585 | 1.1788 | 1.6306 | 1.0725 | | hf_T5_large | 2 | 1.0248 | 0.9068 | 0.0 | 0.0 | 1.5833 | 1.5731 | | speech_transformer | 32 | 1.0038 | 0.9068 | 0.0 | 0.0 | 1.5684 | 1.544 | | shufflenet_v2_x1_0 | 128 | 1.0005 | 1.0532 | 0.8062 | 1.1931 | 1.53 | 1.3689 | | timm_resnest | 32 | 0.9996 | 1.0027 | 0.8044 | 1.1815 | 1.5191 | 1.4517 | | timm_nfnet | 128 | 0.9993 | 0.9999 | 0.0 | 1.2122 | 1.4726 | 1.4222 | | mnasnet1_0 | 32 | 0.9993 | 1.0945 | 0.8568 | 1.2932 | 1.4577 | 1.2734 | | mobilenet_v2_quantized_qat | 96 | 1.0016 | 0.978 | 0.0 | 0.0 | 1.4527 | 1.4479 | | mobilenet_v2 | 96 | 0.9998 | 1.0003 | 0.7313 | 1.0443 | 1.4287 | 1.4088 | | hf_GPT2 | 4 | 1.0046 | 0.9827 | 0.738 | 0.0 | 1.4239 | 1.4306 | | soft_actor_critic | 256 | 0.9921 | 0.7715 | 1.1241 | 0.9985 | 1.4185 | 0.9565 | | resnet50_quantized_qat | 32 | 1.0019 | 0.9619 | 0.0 | 0.0 | 1.401 | 1.3947 | | fastNLP_Bert | 6 | 0.9997 | 0.9761 | 0.7528 | 0.0 | 1.3686 | 1.3445 | | timm_efficientnet | 32 | 0.9551 | 0.8076 | 0.7031 | 1.0629 | 1.3353 | 1.2011 | | LearningToPaint | 96 | 1.0048 | 1.0586 | 0.8687 | 1.2057 | 1.2627 | 1.2074 | | pytorch_unet | 1 | 1.0001 | 0.9982 | 0.8464 | 1.0765 | 1.2042 | 1.1861 | | resnet50 | 32 | 0.9994 | 0.9937 | 0.7608 | 1.1612 | 1.204 | 1.1695 | | Super_SloMo | 6 | 1.0003 | 0.9974 | 0.8669 | 0.0 | 1.18 | 1.1645 | | hf_Bart | 4 | 1.0127 | 0.9757 | 0.0 | 0.0 | 1.1721 | 1.1653 | | vgg16 | 64 | 1.0 | 0.999 | 0.859 | 0.9973 | 1.1707 | 1.1652 | | alexnet | 128 | 0.9991 | 0.998 | 0.8031 | 1.0004 | 1.163 | 1.1651 | | hf_Bert | 4 | 1.0214 | 0.944 | 0.7306 | 0.0 | 1.1575 | 1.1396 | | hf_DistilBert | 8 | 0.9999 | 0.9569 | 0.6872 | 0.0 | 1.1481 | 1.1546 | | timm_regnet | 32 | 0.9653 | 0.9617 | 0.7795 | 1.096 | 1.1283 | 1.0941 | | pytorch_stargan | 16 | 0.9997 | 0.983 | 0.866 | 0.9896 | 1.1189 | 1.0913 | | Background_Matting | 4 | 1.0006 | 1.0218 | 0.866 | 1.0816 | 1.1153 | 1.1069 | | hf_Reformer | 4 | 0.9961 | 0.0 | 0.9267 | 0.0 | 1.1095 | 1.1343 | | hf_BigBird | 2 | 0.9915 | 0.939 | 0.9612 | 0.0 | 1.0921 | 1.0042 | | yolov3 | 16 | 1.0 | 0.9954 | 0.7893 | 1.1839 | 1.0795 | 1.0647 | | attention_is_all_you_need_pytorch | 256 | 0.9999 | 0.9726 | 0.0 | 0.0 | 1.047 | 1.033 | | timm_vision_transformer_large | 8 | 0.9982 | 0.9912 | 0.0 | 0.9805 | 1.044 | 1.0331 | | tts_angular | 64 | 0.9937 | 0.964 | 0.9933 | 1.0231 | 1.0136 | 1.0218 | | timm_vovnet | 32 | 0.9102 | 0.9045 | 0.7132 | 0.9774 | 1.0069 | 1.0176 | | dlrm | 2048 | 1.0064 | 1.0734 | 0.0 | 0.0 | 1.0006 | 0.0 | | demucs | 4 | 0.9997 | 0.9998 | 0.999 | 0.9999 | 1.0 | 1.0007 | | nvidia_deeprecommender | 256 | 0.9994 | 0.9628 | 0.585 | 0.942 | 0.904 | 0.9643 | | hf_GPT2_large | 4 | 1.0004 | 0.9805 | 0.0 | 0.0 | 0.0 | 1.3706 | | hf_T5 | 8 | 1.0002 | 0.9932 | 0.0 | 0.0 | 0.0 | 1.5515 | | tacotron2 | 64 | 0.981 | 0.8581 | 0.0 | 0.0 | 0.0 | 0.9362 | | hf_Longformer | 2 | 0.9701 | 0.9013 | 0.8196 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_BigBird | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | yolov3 | 2 | pass | pass | pass | fail_to_run | pass | pass | | BERT_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_to_run | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ | timm_efficientdet | 1 | 19.5344 | 38.4011 | nan | nan | 484.0577 | 488.767 | | yolov3 | 16 | 2.7711 | 8.6894 | 11.9084 | 43.4046 | 419.4861 | 419.8955 | | hf_T5_large | 2 | 13.2998 | 41.15 | nan | nan | 205.3317 | 202.2279 | | timm_vision_transformer | 8 | 0.7808 | 4.1474 | 5.8215 | 9.3655 | 153.43 | 160.5928 | | speech_transformer | 32 | 1.5424 | 8.2938 | nan | nan | 152.3735 | 147.9389 | | timm_resnest | 32 | 0.5383 | 2.6812 | 3.7424 | 35.1306 | 150.1654 | 145.0659 | | attention_is_all_you_need_pytorch | 256 | 1.0734 | 7.1292 | nan | nan | 137.7387 | 139.7203 | | timm_vision_transformer_large | 8 | 2.223 | 13.8751 | nan | 24.351 | 126.2802 | 123.9619 | | pytorch_stargan | 16 | 0.3789 | 2.3643 | 3.1326 | 3.9188 | 107.0355 | 104.0851 | | pytorch_struct | 200 | 0.2366 | 0.7827 | 1.3456 | 4.0715 | 99.505 | 98.1575 | | BERT_pytorch | 16 | 1.4194 | 7.614 | nan | nan | 92.0393 | 92.0811 | | fastNLP_Bert | 6 | 1.4306 | 6.6169 | 10.0451 | nan | 65.652 | 63.418 | | hf_GPT2 | 4 | 1.2488 | 6.1179 | 8.8738 | nan | 63.5447 | 63.521 | | hf_Bart | 4 | 1.3924 | 8.089 | nan | nan | 49.9676 | 49.9717 | | densenet121 | 4 | 1.9897 | 13.3477 | 20.1678 | 88.3763 | 45.0957 | 43.7205 | | mobilenet_v3_large | 32 | 0.8275 | 4.8204 | 6.7604 | 53.5764 | 44.9158 | 46.9735 | | hf_Albert | 8 | 1.0066 | 5.8746 | 8.5532 | nan | 41.987 | 41.132 | | hf_BigBird | 2 | 7.3861 | 13.5387 | 29.953 | nan | 41.2734 | 26.6352 | | resnet50_quantized_qat | 32 | 1.061 | 9.0448 | nan | nan | 39.8902 | 40.3176 | | hf_Bert | 4 | 1.312 | 6.2693 | 8.8293 | nan | 39.8395 | 38.7377 | | timm_regnet | 32 | 2.173 | 8.4238 | 20.7651 | 47.6157 | 37.2439 | 35.16 | | hf_Reformer | 4 | 2.3483 | nan | 9.1124 | nan | 36.065 | 30.7238 | | timm_efficientnet | 32 | 1.6787 | 6.665 | 16.1146 | 52.4346 | 34.2419 | 34.4653 | | mnasnet1_0 | 32 | 0.7461 | 4.4921 | 6.4014 | 30.714 | 31.0909 | 30.7546 | | resnet50 | 32 | 0.7937 | 4.9477 | 6.925 | 32.2699 | 31.0875 | 29.832 | | hf_DistilBert | 8 | 0.4278 | 3.0834 | 6.0696 | nan | 30.4362 | 29.5285 | | resnext50_32x4d | 8 | 0.8239 | 4.9203 | 6.8365 | 28.5464 | 30.2931 | 30.0266 | | timm_vovnet | 32 | 1.4222 | 4.5909 | 10.441 | 23.5649 | 30.0127 | 29.7463 | | timm_nfnet | 128 | 1.8844 | 7.7171 | nan | 29.8502 | 29.8712 | 28.8763 | | mobilenet_v2_quantized_qat | 96 | 1.1759 | 8.8754 | nan | nan | 27.0997 | 27.2946 | | functorch_dp_cifar10 | 64 | 0.3232 | 1.9699 | 2.8309 | 5.5366 | 26.1947 | 24.9937 | | resnet18 | 16 | 0.3858 | 1.8912 | 2.6752 | 17.5591 | 23.2902 | 20.4971 | | shufflenet_v2_x1_0 | 128 | 0.8656 | 5.4261 | 7.6883 | 26.8524 | 18.5748 | 17.9867 | | Super_SloMo | 6 | 0.9695 | 5.0542 | 6.7627 | nan | 17.3419 | 16.4668 | | Background_Matting | 4 | 0.6979 | 4.5367 | 6.7144 | 29.2894 | 16.7635 | 16.0163 | | mobilenet_v2 | 96 | 0.7343 | 4.4782 | 6.6781 | 37.1045 | 16.669 | 16.3002 | | pytorch_unet | 1 | 0.4223 | 2.1063 | 2.9975 | 19.6418 | 8.2272 | 7.7305 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3535 | 2.202 | 3.0539 | 3.8439 | 8.1719 | 8.0926 | | LearningToPaint | 96 | 0.4124 | 1.9651 | 2.8324 | 23.8303 | 7.2019 | 6.8944 | | squeezenet1_1 | 32 | 0.2563 | 0.9557 | 1.3863 | 4.5328 | 4.0598 | 3.8616 | | nvidia_deeprecommender | 256 | 0.1895 | 0.4298 | 0.6854 | 2.4393 | 4.0142 | 3.7143 | | drq | 1 | 0.1402 | 0.4424 | 0.8198 | 3.4662 | 3.7694 | 3.1945 | | vgg16 | 64 | 0.1869 | 0.6441 | 1.0464 | 2.4609 | 3.6811 | 3.2422 | | dlrm | 2048 | 0.4444 | 0.8198 | nan | nan | 3.4517 | nan | | soft_actor_critic | 256 | 0.2031 | 0.3372 | 0.4948 | 1.5206 | 3.0611 | 2.6231 | | alexnet | 128 | 0.1421 | 0.4161 | 0.6606 | 2.3558 | 2.9654 | 2.6911 | | dcgan | 32 | 0.1641 | 0.4494 | 0.6683 | 3.7309 | 2.678 | 2.4053 | | lennard_jones | 1000 | 0.1381 | 0.289 | 0.4429 | 1.0648 | 1.9631 | 1.736 | | tts_angular | 64 | 0.2061 | 0.2786 | 0.3976 | 1.0162 | 1.8605 | 1.6749 | | demucs | 4 | 0.2929 | 0.2934 | 0.2977 | 0.2969 | 0.2011 | 0.1967 | | hf_GPT2_large | 4 | 4.9818 | 19.3363 | nan | nan | nan | 143.1625 | | tacotron2 | 64 | 16.7009 | 28.6252 | nan | nan | nan | 106.378 | | hf_T5 | 8 | 2.1787 | 9.4406 | nan | nan | nan | 44.804 | | hf_Longformer | 2 | 5.7342 | 13.862 | 78.3703 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4314 | 1.4314 | | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.4036 | 1.4036 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2637 | 0.7837 | 1.3107 | 1.3377 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | nan | 1.1858 | 1.1912 | | timm_efficientdet | 1 | 1.0111 | 0.823 | nan | nan | 1.1165 | 1.1428 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.7638 | 1.1005 | 1.1105 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3374 | 0.9742 | 1.0823 | 1.1267 | | timm_nfnet | 128 | 0.9358 | 0.8936 | nan | 0.9478 | 1.0219 | 1.0495 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.35 | 0.8662 | 0.9791 | 1.0072 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3701 | nan | 0.9703 | 1.1094 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9284 | 0.9323 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9749 | 0.9212 | 0.9238 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8814 | 0.9151 | 0.919 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | 1.0085 | 0.9023 | 0.9928 | | timm_resnest | 32 | 0.9935 | 0.8793 | 0.3235 | 0.8021 | 0.8982 | 0.9697 | | speech_transformer | 32 | 0.9982 | 0.9159 | nan | nan | 0.896 | 0.8996 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9986 | 0.9173 | 0.3919 | 0.9169 | 0.8848 | 0.9654 | | hf_Albert | 8 | 0.9333 | 0.9333 | 0.2846 | nan | 0.8836 | 1.2215 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8681 | 0.8829 | 0.8964 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9997 | 0.8415 | nan | 0.801 | 0.8616 | 1.0285 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8496 | 0.859 | 0.8608 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.797 | 0.8564 | 0.8913 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3435 | 0.8551 | 0.8562 | 0.9307 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.3331 | 0.8263 | 0.8531 | 0.8659 | | hf_Bart | 4 | 0.9617 | 0.8598 | nan | nan | 0.8503 | 1.1284 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3385 | nan | 0.8354 | 1.0952 | | resnext50_32x4d | 8 | 0.9954 | 0.8671 | 0.3596 | 0.8203 | 0.8303 | 0.8352 | | BERT_pytorch | 16 | 1.0 | 0.8995 | nan | nan | 0.825 | 1.0689 | | hf_BigBird | 2 | 0.9604 | 0.9604 | 0.4301 | nan | 0.8211 | 1.0393 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.7903 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8772 | 0.7632 | 0.8778 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3305 | 0.8104 | 0.7478 | 0.8187 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7455 | 0.743 | 0.8332 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3201 | 0.7741 | 0.7286 | 0.7339 | | LearningToPaint | 96 | 0.9442 | 0.6896 | 0.3385 | 0.6503 | 0.7133 | 0.7462 | | hf_Bert | 4 | 0.9683 | 0.9011 | 0.3525 | nan | 0.7048 | 0.985 | | dlrm | 2048 | 0.7302 | 0.7305 | nan | nan | 0.7035 | nan | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3593 | 0.6971 | 0.6902 | 0.7049 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3212 | nan | 0.6596 | 0.9466 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6639 | 0.6471 | 0.6497 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 1.0947 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | nan | nan | 0.4682 | 0.6183 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.429 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4456 | 0.8227 | 0.4056 | 0.4212 | | hf_Reformer | 4 | 0.3011 | nan | 0.2397 | nan | 0.299 | 0.9882 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | nan | nan | 1.1507 | | tacotron2 | 64 | 0.9906 | 1.093 | nan | nan | nan | 1.1496 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | hf_Longformer | 2 | 0.9603 | 0.9603 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0285 | 0.9414 | 0.0 | 0.0 | 3.7345 | 1.5254 | | CamemBert | 1 | 1.0493 | 0.9732 | 1.3251 | 0.0 | 2.3889 | 1.5405 | | MT5ForConditionalGeneration | 8 | 1.0272 | 0.9263 | 0.0 | 0.0 | 2.2531 | 1.9848 | | DistillGPT2 | 1 | 1.0322 | 0.9458 | 1.0657 | 0.0 | 2.099 | 1.9009 | | MobileBertForMaskedLM | 32 | 1.023 | 0.9232 | 0.0 | 0.0 | 1.9829 | 1.574 | | GoogleFnet | 1 | 0.9985 | 0.8173 | 0.9815 | 1.1247 | 1.9188 | 1.1214 | | GPT2ForSequenceClassification | 4 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.6662 | 1.6568 | | T5ForConditionalGeneration | 4 | 1.0029 | 0.9667 | 0.0 | 0.0 | 1.4388 | 1.4275 | | M2M100ForConditionalGeneration | 8 | 1.0412 | 0.8942 | 1.0013 | 0.0 | 1.4178 | 1.4085 | | MobileBertForQuestionAnswering | 64 | 1.024 | 0.9187 | 0.0 | 0.0 | 1.4036 | 1.2789 | | ElectraForCausalLM | 32 | 1.0004 | 0.9312 | 0.0 | 0.0 | 1.3702 | 1.4028 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9844 | 0.0 | 0.0 | 1.3541 | 1.3368 | | AlbertForQuestionAnswering | 4 | 1.0002 | 1.0018 | 0.0 | 0.0 | 1.2567 | 1.2522 | | AlbertForMaskedLM | 4 | 0.9993 | 0.9996 | 0.0 | 0.0 | 1.25 | 1.2519 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9892 | 0.7379 | 0.0 | 1.2473 | 1.2318 | | T5Small | 1 | 1.0191 | 0.9543 | 0.0 | 0.0 | 1.2442 | 1.2308 | | PLBartForConditionalGeneration | 16 | 1.0124 | 0.9613 | 0.0 | 0.0 | 1.1874 | 1.188 | | OPTForCausalLM | 32 | 1.0037 | 0.932 | 0.0 | 0.0 | 1.1825 | 1.1983 | | XGLMForCausalLM | 8 | 1.0128 | 0.9394 | 0.0 | 0.0 | 1.1706 | 1.1753 | | LayoutLMForMaskedLM | 16 | 1.0002 | 0.971 | 0.0 | 0.0 | 1.1633 | 1.1716 | | DistilBertForQuestionAnswering | 64 | 0.9997 | 0.985 | 0.7131 | 0.0 | 1.1444 | 1.1262 | | RobertaForCausalLM | 64 | 1.0004 | 0.9637 | 0.7465 | 0.0 | 1.1133 | 1.1212 | | Speech2Text2ForCausalLM | 128 | 0.9989 | 0.9259 | 0.6593 | 0.0 | 1.11 | 1.1484 | | BigBird | 1 | 0.9894 | 0.937 | 0.991 | 0.0 | 1.1023 | 1.0034 | | BartForCausalLM | 4 | 1.0007 | 0.9668 | 0.0 | 0.0 | 1.0962 | 1.1067 | | BartForConditionalGeneration | 2 | 1.0009 | 0.9887 | 0.0 | 0.0 | 1.0962 | 1.0896 | | MegatronBertForQuestionAnswering | 16 | 1.038 | 1.0104 | 0.7572 | 0.0 | 1.0947 | 1.0716 | | MBartForConditionalGeneration | 16 | 1.0102 | 0.9766 | 0.0 | 0.0 | 1.0887 | 1.0775 | | DebertaForMaskedLM | 4 | 0.9321 | 0.8111 | 0.7317 | 0.0 | 1.0885 | 1.0732 | | MegatronBertForCausalLM | 16 | 1.0332 | 1.0027 | 0.7578 | 0.0 | 1.087 | 1.0785 | | PegasusForConditionalGeneration | 16 | 1.0101 | 0.9819 | 0.7569 | 0.0 | 1.0857 | 1.0825 | | BertForQuestionAnswering | 128 | 0.9997 | 0.9882 | 0.0 | 0.0 | 1.0722 | 1.0661 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9942 | 0.0 | 0.0 | 1.0696 | 1.0709 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0005 | 0.9265 | 0.0 | 0.0 | 1.0628 | 1.0696 | | DebertaForQuestionAnswering | 8 | 0.9976 | 0.9917 | 0.6821 | 0.0 | 1.0623 | 1.2025 | | DistilBertForMaskedLM | 64 | 1.0 | 0.9519 | 0.7122 | 0.0 | 1.0362 | 1.0546 | | BertForMaskedLM | 64 | 1.0003 | 0.9524 | 0.7302 | 0.0 | 1.0338 | 1.0381 | | PLBartForCausalLM | 32 | 1.0055 | 0.9348 | 0.7321 | 0.0 | 1.0224 | 1.0494 | | BlenderbotSmallForCausalLM | 64 | 1.0022 | 0.9105 | 0.6827 | 0.0 | 1.0131 | 1.0345 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9556 | 0.0 | 0.0 | 0.9981 | 1.0096 | | MBartForCausalLM | 32 | 1.0013 | 0.9555 | 0.0 | 0.0 | 0.9967 | 1.0069 | | PegasusForCausalLM | 32 | 0.9998 | 0.953 | 0.7325 | 0.0 | 0.9888 | 1.0008 | | AllenaiLongformerBase | 1 | 0.953 | 0.7915 | 0.7884 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | T5Small | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | XGLMForCausalLM | 8 | 2.2364 | 12.2125 | nan | nan | 203.4086 | 201.0863 | | DebertaForMaskedLM | 4 | 4.684 | 11.0814 | 44.7781 | nan | 163.7151 | 106.9608 | | DebertaForQuestionAnswering | 8 | 4.5483 | 11.6349 | 43.993 | nan | 152.0741 | 118.2059 | | M2M100ForConditionalGeneration | 8 | 2.7543 | 15.4794 | 23.643 | nan | 128.0751 | 124.2115 | | YituTechConvBert | 1 | 2.0946 | 9.5284 | nan | nan | 115.4649 | 119.3641 | | MT5ForConditionalGeneration | 8 | 3.4744 | 13.6659 | nan | nan | 90.4534 | 91.1223 | | MobileBertForMaskedLM | 32 | 7.7855 | 27.1609 | nan | nan | 88.9601 | 85.7795 | | MobileBertForQuestionAnswering | 64 | 7.9327 | 27.5186 | nan | nan | 74.7874 | 71.876 | | MegatronBertForCausalLM | 16 | 3.0219 | 12.5327 | 19.6699 | nan | 61.5191 | 59.8845 | | MegatronBertForQuestionAnswering | 16 | 3.0691 | 13.2977 | 19.1034 | nan | 60.2609 | 58.2808 | | LayoutLMForSequenceClassification | 16 | 1.6734 | 6.6917 | 10.1343 | nan | 59.7267 | 60.187 | | T5ForConditionalGeneration | 4 | 2.1399 | 8.8895 | nan | nan | 58.3394 | 57.0848 | | PegasusForConditionalGeneration | 16 | 2.6227 | 14.7158 | 24.2283 | nan | 58.1897 | 54.3056 | | BartForConditionalGeneration | 2 | 2.8248 | 15.0065 | nan | nan | 57.0652 | 54.7753 | | T5Small | 1 | 2.1902 | 8.9903 | nan | nan | 55.4364 | 53.2137 | | MBartForConditionalGeneration | 16 | 2.7868 | 15.512 | nan | nan | 54.3119 | 53.1455 | | PLBartForConditionalGeneration | 16 | 1.3887 | 8.298 | nan | nan | 47.5246 | 46.3964 | | BlenderbotSmallForConditionalGeneration | 64 | 1.7139 | 10.0168 | nan | nan | 43.6075 | 41.5748 | | BigBird | 1 | 7.296 | 13.5333 | 29.6711 | nan | 40.7238 | 26.8699 | | ElectraForCausalLM | 32 | 1.2891 | 6.2441 | nan | nan | 40.6712 | 39.969 | | DistillGPT2 | 1 | 0.6422 | 3.1221 | 4.4918 | nan | 33.8479 | 32.6814 | | LayoutLMForMaskedLM | 16 | 1.6131 | 6.6316 | nan | nan | 32.8126 | 32.5964 | | BertForMaskedLM | 64 | 1.2973 | 6.3901 | 9.4361 | nan | 32.777 | 31.6779 | | ElectraForQuestionAnswering | 64 | 1.3222 | 6.4111 | nan | nan | 32.5117 | 31.4854 | | GPT2ForSequenceClassification | 4 | 1.2751 | 6.1953 | nan | nan | 32.0765 | 31.1399 | | RobertaForCausalLM | 64 | 1.3104 | 6.1902 | 9.2915 | nan | 28.0396 | 27.4422 | | BertForQuestionAnswering | 128 | 1.3166 | 6.2802 | nan | nan | 27.7294 | 27.1936 | | PegasusForCausalLM | 32 | 1.0161 | 5.707 | 8.775 | nan | 27.1087 | 25.1376 | | MBartForCausalLM | 32 | 0.9522 | 5.5767 | nan | nan | 25.4243 | 24.6154 | | RobertaForQuestionAnswering | 128 | 1.3205 | 6.387 | nan | nan | 24.5494 | 23.8515 | | TrOCRForCausalLM | 32 | 0.9241 | 5.5701 | nan | nan | 24.4333 | 24.1797 | | BartForCausalLM | 4 | 1.0079 | 5.6176 | nan | nan | 24.3593 | 23.6588 | | AlbertForMaskedLM | 4 | 1.1157 | 5.8703 | nan | nan | 23.8611 | 23.0601 | | GoogleFnet | 1 | 0.7904 | 3.3495 | 10.4595 | 9.6049 | 23.8114 | 16.1369 | | BlenderbotSmallForCausalLM | 64 | 0.6439 | 3.7467 | 5.6889 | nan | 23.625 | 22.6972 | | DistilBertForMaskedLM | 64 | 0.4729 | 2.9552 | 5.8879 | nan | 23.0127 | 22.634 | | AlbertForQuestionAnswering | 4 | 1.1461 | 5.9483 | nan | nan | 22.7287 | 21.5179 | | OPTForCausalLM | 32 | 1.0353 | 5.881 | nan | nan | 21.8562 | 20.7457 | | DistilBertForQuestionAnswering | 64 | 0.4816 | 3.0171 | 5.9235 | nan | 21.8186 | 22.1039 | | CamemBert | 1 | 1.38 | 6.1479 | 8.5874 | nan | 21.7413 | 21.2151 | | Speech2Text2ForCausalLM | 128 | 0.577 | 2.9045 | 4.6098 | nan | 19.6271 | 18.24 | | PLBartForCausalLM | 32 | 0.4938 | 2.9552 | 4.3734 | nan | 18.8954 | 18.2071 | | AllenaiLongformerBase | 1 | 5.9078 | 14.4262 | 80.0409 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | nan | 1.0596 | 1.1223 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | nan | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9155 | nan | nan | 0.8564 | 1.0758 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | nan | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | nan | 0.842 | 1.3737 | | BigBird | 1 | 0.999 | 0.9542 | 0.4215 | nan | 0.8224 | 1.0108 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | nan | nan | 0.8215 | 1.1049 | | DistillGPT2 | 1 | 0.9984 | 0.8218 | 0.3795 | nan | 0.8173 | 0.9383 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | nan | nan | 0.8157 | 0.9642 | | YituTechConvBert | 1 | 0.9858 | 0.8198 | nan | nan | 0.808 | 0.8738 | | BartForConditionalGeneration | 2 | 1.0 | 0.893 | nan | nan | 0.7817 | 0.9515 | | PegasusForCausalLM | 32 | 0.9593 | 0.9232 | 0.3909 | nan | 0.7774 | 0.9692 | | M2M100ForConditionalGeneration | 8 | 1.007 | 0.9507 | 0.3799 | nan | 0.7712 | 1.016 | | GoogleFnet | 1 | 0.9983 | 0.9453 | 0.3715 | 1.0813 | 0.7698 | 0.9373 | | MT5ForConditionalGeneration | 8 | 1.0034 | 0.8861 | nan | nan | 0.7623 | 0.9396 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | nan | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3614 | nan | 0.7492 | 0.9186 | | PLBartForConditionalGeneration | 16 | 1.0 | 0.8743 | nan | nan | 0.7397 | 0.9638 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | nan | 0.7381 | 0.9055 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | nan | nan | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | nan | 0.7189 | 1.0246 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | nan | 0.7161 | 0.9248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | nan | nan | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | nan | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | nan | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | nan | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | nan | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | nan | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | nan | nan | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | nan | nan | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | nan | 0.6775 | 0.8801 | | OPTForCausalLM | 32 | 0.9982 | 0.8655 | nan | nan | 0.6761 | 0.8847 | | ElectraForCausalLM | 32 | 0.9994 | 0.883 | nan | nan | 0.6731 | 0.905 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | nan | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | nan | 0.6385 | 0.8993 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | nan | 0.6375 | 0.8975 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | nan | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | nan | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | nan | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3553 | nan | 0.4267 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | nan | 0.3264 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9981 | 0.9515 | 0.3209 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.9956 | 0.8421 | 1.2485 | 1.8144 | 1.7733 | | lcnet_050 | 128 | 0.9568 | 0.9489 | 0.7675 | 1.4962 | 1.6425 | 1.6316 | | coat_lite_mini | 128 | 1.0 | 1.0 | 0.8447 | 1.0566 | 1.6056 | 1.5895 | | regnety_002 | 128 | 0.9778 | 0.9844 | 0.8615 | 1.3561 | 1.4813 | 1.3447 | | dm_nfnet_f0 | 128 | 1.0 | 1.0003 | 0.0 | 1.2124 | 1.4725 | 1.422 | | xcit_large_24_p8_224 | 5 | 1.003 | 1.0032 | 0.0 | 0.0 | 1.4529 | 1.4094 | | hrnet_w18 | 128 | 0.9999 | 0.9985 | 0.0 | 1.3201 | 1.418 | 1.3775 | | volo_d1_224 | 64 | 0.9999 | 0.9959 | 0.0 | 1.1295 | 1.3859 | 1.3634 | | dla102 | 128 | 1.0002 | 1.0008 | 0.0 | 1.2853 | 1.3821 | 1.3693 | | nfnet_l0 | 128 | 0.9997 | 0.7891 | 0.0 | 1.0518 | 1.3733 | 1.3288 | | res2net50_14w_8s | 128 | 0.9999 | 1.0 | 0.0 | 1.2307 | 1.3564 | 1.3208 | | mobilenetv2_100 | 128 | 0.9662 | 0.9648 | 0.7065 | 1.0145 | 1.3373 | 1.3526 | | mobilenetv3_large_100 | 128 | 0.9664 | 0.9632 | 0.7654 | 1.1624 | 1.3356 | 1.3413 | | crossvit_9_240 | 128 | 0.9999 | 0.9988 | 0.0 | 1.0243 | 1.3305 | 1.3051 | | adv_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1253 | 1.328 | 1.3083 | | gluon_inception_v3 | 128 | 1.0 | 0.9988 | 0.0 | 1.1224 | 1.3249 | 1.3075 | | inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1257 | 1.3244 | 1.3076 | | res2next50 | 128 | 1.0 | 1.0009 | 0.0 | 1.166 | 1.3121 | 1.2748 | | resnest101e | 64 | 1.0001 | 1.0035 | 0.0 | 1.1963 | 1.3115 | 1.2714 | | gmixer_24_224 | 128 | 0.9999 | 0.8348 | 0.0 | 0.98 | 1.2974 | 1.2696 | | fbnetv3_b | 128 | 0.9642 | 0.9614 | 0.7623 | 1.1326 | 1.283 | 1.2951 | | botnet26t_256 | 128 | 0.9851 | 0.9857 | 0.7892 | 1.2271 | 1.2742 | 1.2801 | | jx_nest_base | 32 | 0.9998 | 0.9926 | 0.0 | 1.217 | 1.2725 | 1.2481 | | sebotnet33ts_256 | 64 | 0.9753 | 0.8072 | 0.0 | 1.0528 | 1.2706 | 1.2762 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7721 | 0.0 | 1.0301 | 1.2706 | 1.2477 | | selecsls42b | 128 | 0.9998 | 0.9991 | 0.8157 | 1.2083 | 1.2671 | 1.2514 | | tf_efficientnet_b0 | 128 | 0.9776 | 0.7843 | 0.0 | 0.9848 | 1.2613 | 1.2686 | | mnasnet_100 | 128 | 0.9663 | 0.9639 | 0.7855 | 1.1575 | 1.2598 | 1.2787 | | eca_halonext26ts | 128 | 0.9877 | 0.7787 | 0.0 | 1.0289 | 1.2502 | 1.2494 | | fbnetc_100 | 128 | 0.967 | 0.9622 | 0.7908 | 1.1879 | 1.2497 | 1.2635 | | ese_vovnet19b_dw | 128 | 0.9795 | 0.9777 | 0.7445 | 1.1452 | 1.2404 | 1.2461 | | spnasnet_100 | 128 | 0.9605 | 0.9573 | 0.7734 | 1.1366 | 1.2375 | 1.2543 | | cspdarknet53 | 64 | 0.9581 | 0.9526 | 0.7322 | 1.1835 | 1.2287 | 1.2391 | | res2net101_26w_4s | 64 | 0.9997 | 0.9972 | 0.7705 | 1.1739 | 1.2283 | 1.1885 | | convit_base | 64 | 0.9998 | 0.9992 | 0.0 | 1.195 | 1.2216 | 1.2164 | | pit_b_224 | 64 | 1.0001 | 0.9996 | 0.0 | 1.055 | 1.221 | 1.211 | | gmlp_s16_224 | 128 | 1.0 | 0.9994 | 0.0 | 0.9989 | 1.2164 | 1.2053 | | rexnet_100 | 128 | 0.9723 | 0.8169 | 0.0 | 0.9835 | 1.2142 | 1.2193 | | pnasnet5large | 16 | 0.9998 | 0.9985 | 0.0 | 1.0838 | 1.2112 | 1.1932 | | tinynet_a | 128 | 0.9659 | 0.7757 | 0.6205 | 0.9713 | 1.1925 | 1.1949 | | cait_m36_384 | 4 | 0.9998 | 0.0 | 0.0 | 0.0 | 1.1826 | 1.158 | | tf_mixnet_l | 128 | 0.9853 | 0.8897 | 0.0 | 1.0177 | 1.173 | 1.1697 | | dpn107 | 32 | 0.958 | 0.9367 | 0.7817 | 1.0288 | 1.1726 | 1.202 | | mobilevit_s | 64 | 0.9792 | 0.762 | 0.0 | 0.9468 | 1.1702 | 1.1666 | | repvgg_a2 | 128 | 0.9641 | 0.9623 | 0.8288 | 1.1224 | 1.1692 | 1.1652 | | poolformer_m36 | 64 | 0.9998 | 0.9993 | 0.0 | 0.0 | 1.1661 | 1.1475 | | mixnet_l | 128 | 0.9849 | 0.8858 | 0.0 | 1.0185 | 1.1534 | 1.1505 | | twins_pcpvt_base | 64 | 1.0001 | 0.9974 | 0.75 | 1.0624 | 1.148 | 1.1172 | | swin_base_patch4_window7_224 | 64 | 0.9999 | 0.9785 | 0.0 | 0.9932 | 1.1469 | 1.1322 | | convnext_base | 64 | 0.9999 | 0.9988 | 0.0 | 1.0441 | 1.1157 | 1.1262 | | beit_base_patch16_224 | 64 | 0.9998 | 0.9801 | 0.0 | 0.9504 | 1.1141 | 1.1053 | | swsl_resnext101_32x16d | 32 | 1.0001 | 0.9988 | 0.0 | 1.1071 | 1.1068 | 1.0712 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9995 | 0.7673 | 1.0156 | 1.0955 | 1.0834 | | gluon_xception65 | 32 | 0.9998 | 0.9975 | 0.0 | 1.0403 | 1.0871 | 1.0759 | | vit_base_patch16_224 | 64 | 1.0002 | 0.999 | 0.7662 | 0.9763 | 1.0855 | 1.0734 | | mixer_b16_224 | 128 | 1.0006 | 1.0001 | 0.0 | 0.9771 | 1.0808 | 1.0736 | | convmixer_768_32 | 32 | 0.9999 | 1.0002 | 0.0 | 1.0615 | 1.0783 | 1.0744 | | gernet_l | 128 | 0.9744 | 0.9723 | 0.8239 | 1.0992 | 1.075 | 1.0704 | | visformer_small | 128 | 1.0001 | 1.0022 | 0.797 | 1.0217 | 1.0495 | 1.0162 | | resmlp_12_224 | 128 | 0.9999 | 1.001 | 0.6956 | 0.0 | 0.9499 | 0.9719 | | tnt_s_patch16_224 | 128 | 1.0 | 0.9992 | 0.0 | 1.6263 | 0.0 | 1.5436 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | convnext_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | fail_to_run | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | jx_nest_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | convit_base | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+-------------+----------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | twins_pcpvt_base | 64 | 2.064 | 13.0072 | 21.5012 | 42.855 | 431.1592 | 426.4103 | | coat_lite_mini | 128 | 1.0194 | 5.4653 | 7.961 | 14.7686 | 362.4216 | 372.6703 | | mobilevit_s | 64 | 1.5683 | 7.1641 | nan | 42.4621 | 233.8428 | 237.9062 | | eca_halonext26ts | 128 | 1.4144 | 5.4751 | nan | 55.2357 | 204.8437 | 207.0974 | | sebotnet33ts_256 | 64 | 1.7651 | 6.6709 | nan | 51.039 | 185.8238 | 191.2608 | | eca_botnext26ts_256 | 128 | 1.3797 | 5.2911 | nan | 52.9221 | 179.8768 | 176.7545 | | swin_base_patch4_window7_224 | 64 | 2.5123 | 12.7354 | nan | 58.0591 | 177.0112 | 174.7488 | | xcit_large_24_p8_224 | 5 | 2.603 | 17.1709 | nan | nan | 172.3324 | 164.8544 | | jx_nest_base | 32 | 1.6708 | 9.2321 | nan | 57.8786 | 155.4547 | 156.5451 | | convnext_base | 64 | 1.2341 | 5.9929 | nan | 20.8438 | 133.0295 | 129.8216 | | cait_m36_384 | 4 | 2.6486 | nan | nan | nan | 132.7509 | 130.12 | | hrnet_w18 | 128 | 5.6217 | 31.9848 | nan | 251.7181 | 106.8258 | 100.7524 | | botnet26t_256 | 128 | 1.3057 | 4.4635 | 10.0598 | 40.2751 | 106.2411 | 103.5341 | | crossvit_9_240 | 128 | 1.3396 | 7.9862 | nan | 27.0701 | 97.9064 | 96.8689 | | resnest101e | 64 | 2.998 | 16.9945 | nan | 78.2291 | 93.9541 | 89.7619 | | pnasnet5large | 16 | 4.1626 | 22.9703 | nan | 123.7628 | 87.4338 | 84.1545 | | volo_d1_224 | 64 | 1.1595 | 7.6273 | nan | 28.0879 | 85.2424 | 83.6849 | | gmlp_s16_224 | 128 | 0.9511 | 6.2939 | nan | 13.365 | 71.7498 | 69.4367 | | visformer_small | 128 | 0.9009 | 4.189 | 6.2793 | 24.3038 | 71.1462 | 69.6831 | | pit_b_224 | 64 | 0.9339 | 4.8631 | nan | 12.5251 | 66.2774 | 65.1378 | | res2net101_26w_4s | 64 | 2.9852 | 17.3432 | 28.4155 | 80.897 | 55.6027 | 52.0513 | | gmixer_24_224 | 128 | 1.0133 | 7.3092 | nan | 16.5474 | 51.9895 | 50.5586 | | convit_base | 64 | 0.9843 | 5.9421 | nan | 18.0525 | 50.9922 | 49.952 | | res2net50_14w_8s | 128 | 2.5693 | 15.6494 | nan | 98.8662 | 50.8157 | 49.7271 | | gluon_xception65 | 32 | 1.6885 | 11.1965 | nan | 41.7582 | 49.2318 | 45.5937 | | poolformer_m36 | 64 | 1.8121 | 9.7062 | nan | nan | 47.0371 | 44.6651 | | resmlp_12_224 | 128 | 0.6088 | 2.794 | 5.5064 | nan | 42.3381 | 38.0426 | | swsl_resnext101_32x16d | 32 | 1.6289 | 10.0288 | nan | 39.6141 | 41.9677 | 41.3616 | | dpn107 | 32 | 3.7727 | 14.7274 | 45.6394 | 76.1359 | 40.3245 | 37.6555 | | mixer_b16_224 | 128 | 0.6548 | 3.2155 | nan | 10.7856 | 37.0102 | 35.4768 | | deit_base_distilled_patch16_224 | 64 | 0.8289 | 4.303 | 6.6094 | 10.4203 | 36.0592 | 34.6956 | | convmixer_768_32 | 32 | 1.0862 | 6.4498 | nan | 13.7196 | 35.8067 | 33.0945 | | fbnetv3_b | 128 | 3.0734 | 11.1026 | 29.9803 | 76.0043 | 35.7771 | 33.8855 | | vit_base_patch16_224 | 64 | 0.8583 | 4.1826 | 6.5315 | 9.6845 | 35.7583 | 35.0589 | | gluon_inception_v3 | 128 | 1.4815 | 8.9849 | nan | 66.9443 | 35.0345 | 32.4497 | | inception_v3 | 128 | 1.4787 | 9.0238 | nan | 67.1459 | 34.8548 | 32.5473 | | adv_inception_v3 | 128 | 1.4876 | 8.9769 | nan | 66.9311 | 34.3905 | 32.5332 | | tf_mixnet_l | 128 | 5.7484 | 13.3541 | nan | 68.7911 | 33.8729 | 32.1963 | | ghostnet_100 | 128 | 2.6432 | 9.6507 | 13.7666 | 58.927 | 32.695 | 30.8681 | | beit_base_patch16_224 | 64 | 1.0871 | 5.6134 | nan | 13.7621 | 32.6318 | 30.8008 | | mixnet_l | 128 | 5.3204 | 12.7271 | nan | 67.9763 | 32.5983 | 31.893 | | dm_nfnet_f0 | 128 | 2.0094 | 7.6042 | nan | 29.9754 | 32.3805 | 29.3454 | | dla102 | 128 | 1.6603 | 10.0975 | nan | 63.1714 | 32.1124 | 30.2312 | | res2next50 | 128 | 1.4989 | 8.7791 | nan | 66.7002 | 29.6202 | 27.9053 | | rexnet_100 | 128 | 1.8062 | 7.4568 | nan | 102.1027 | 26.5523 | 25.3591 | | tinynet_a | 128 | 1.9614 | 8.2078 | 20.2872 | 61.7507 | 25.7941 | 24.6542 | | cspdarknet53 | 64 | 2.2264 | 7.7188 | 20.8213 | 48.0307 | 23.2515 | 22.0433 | | nfnet_l0 | 128 | 1.7245 | 7.5828 | nan | 27.3095 | 23.1165 | 21.8966 | | tf_efficientnet_b0 | 128 | 1.7202 | 6.9673 | nan | 61.9316 | 22.7574 | 21.5149 | | fbnetc_100 | 128 | 1.9567 | 6.9499 | 18.078 | 45.3002 | 21.9517 | 20.7368 | | spnasnet_100 | 128 | 1.9161 | 6.665 | 17.4815 | 43.4797 | 21.4795 | 20.4556 | | mobilenetv3_large_100 | 128 | 1.5899 | 5.5688 | 13.4352 | 64.4429 | 19.9372 | 19.5642 | | mnasnet_100 | 128 | 1.6356 | 5.5127 | 14.0767 | 37.4665 | 18.8558 | 18.0133 | | mobilenetv2_100 | 128 | 1.6442 | 5.4933 | 13.7945 | 37.5793 | 18.5669 | 17.7858 | | gernet_l | 128 | 1.8816 | 6.4469 | 16.2236 | 35.9904 | 18.4345 | 17.2115 | | repvgg_a2 | 128 | 1.8567 | 6.1905 | 15.7371 | 43.751 | 17.9569 | 16.9557 | | regnety_002 | 128 | 1.4855 | 5.8417 | 13.8786 | 46.2472 | 17.8219 | 17.3541 | | selecsls42b | 128 | 0.7717 | 4.0352 | 5.8995 | 39.8612 | 16.4046 | 15.3492 | | lcnet_050 | 128 | 0.9705 | 3.4278 | 7.1291 | 31.167 | 13.6937 | 12.51 | | ese_vovnet19b_dw | 128 | 0.9768 | 3.251 | 6.9304 | 30.8107 | 12.7375 | 11.8284 | | tnt_s_patch16_224 | 128 | 1.4723 | 10.2065 | nan | 22.8828 | nan | 50.0197 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | aot_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9716 | nan | 0.9859 | 1.5612 | 1.6333 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.7823 | 1.351 | 1.3692 | | nfnet_l0 | 128 | 0.993 | 0.8272 | nan | 0.8084 | 1.2908 | 1.3392 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 0.8682 | 1.2619 | 1.2765 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.8401 | 1.1889 | 1.199 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.2062 | 1.1876 | 1.3282 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | 0.7405 | 1.1793 | 1.2286 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | 0.7612 | 1.1378 | 1.2076 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | 0.7643 | 1.1375 | 1.2068 | | cait_m36_384 | 4 | 0.9994 | nan | nan | nan | 1.1185 | 1.1745 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.7635 | 1.1003 | 1.1104 | | poolformer_m36 | 64 | 0.998 | 0.9512 | nan | nan | 1.0527 | 1.069 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8936 | nan | 0.9479 | 1.0218 | 1.0495 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.8606 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.95 | 0.9994 | 1.0025 | | vit_base_patch16_224 | 64 | 0.9963 | 0.9434 | 0.3153 | 0.8229 | 0.997 | 1.0835 | | deit_base_distilled_patch16_224 | 64 | 0.9964 | 0.9442 | 0.3138 | 0.8242 | 0.9925 | 1.0805 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8403 | 0.9888 | 1.0866 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9345 | 0.9853 | 1.0102 | | mixer_b16_224 | 128 | 0.9952 | 0.9661 | nan | 0.8571 | 0.985 | 1.0538 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | 0.9793 | 0.9836 | 0.9853 | | volo_d1_224 | 64 | 0.996 | 0.9213 | nan | 0.7472 | 0.9799 | 0.9971 | | gmlp_s16_224 | 128 | 0.9959 | 0.9783 | nan | 0.9704 | 0.9766 | 0.9827 | | tf_mixnet_l | 128 | 0.9953 | 0.857 | nan | 0.8574 | 0.9711 | 1.0812 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.784 | 0.9696 | 0.977 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.9194 | nan | nan | 0.9611 | 1.0549 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.7604 | 0.9576 | 0.9855 | | dla102 | 128 | 0.9831 | 0.917 | nan | 0.9529 | 0.9496 | 0.9538 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8649 | 0.9376 | 0.9419 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8982 | 0.9351 | 0.9376 | | res2net101_26w_4s | 64 | 0.9968 | 0.9278 | 0.3243 | 0.8932 | 0.9269 | 0.9548 | | jx_nest_base | 32 | 1.0002 | 0.8966 | nan | 0.7112 | 0.9187 | 1.0509 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9302 | 0.9095 | 0.9161 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | 0.83 | 0.9068 | 1.0518 | | dpn107 | 32 | 0.9985 | 0.9271 | 0.3392 | 0.8941 | 0.9058 | 0.956 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.8618 | 0.9051 | 0.9312 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.9047 | 0.9157 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.9014 | 1.0067 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8745 | 0.9007 | 0.9126 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | 0.9475 | 0.9006 | 0.951 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8954 | 0.899 | 0.9192 | | adv_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | gluon_inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | inception_v3 | 128 | 0.9901 | 0.8617 | nan | 0.8724 | 0.8983 | 0.9073 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.8961 | 0.9077 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8972 | nan | 0.8675 | 0.8931 | 0.9249 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7524 | 0.8921 | 0.923 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8762 | 0.8835 | 0.8875 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8611 | 0.881 | 0.9327 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7599 | 0.8617 | 0.8993 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | 0.745 | 0.8605 | 0.8702 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.6417 | 0.8417 | 1.0633 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.8416 | 0.8498 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7084 | nan | 0.6831 | 0.841 | 0.9711 | | coat_lite_mini | 128 | 1.0049 | 0.8777 | 0.3262 | 0.7873 | 0.8404 | 1.0528 | | resmlp_12_224 | 128 | 0.9893 | 0.943 | 0.2472 | nan | 0.8169 | 0.8253 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.8234 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.6573 | 0.7684 | 0.8011 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.9506 | 0.7463 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8657 | nan | 0.7297 | 0.6496 | 0.8704 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | 0.8539 | nan | 0.8623 | +---------------------------------+-----+--------+-----------+----------------+-------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs/huggingface_float32.png : ![](https://i.imgur.com/Wtf7hQN.png) ../test-dynamo-runner-logs/timm_models_float32.png : ![](https://i.imgur.com/GMzSgFV.png) ../test-dynamo-runner-logs/torchbench_float32.png : ![](https://i.imgur.com/2EbmzXj.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor | 100%, 1/1  |
+----------+------------+

Geometric mean speedup

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.42x    |
+----------+------------+

Mean compilation time (seconds)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   18.23    |
+----------+------------+

Peak memory footprint compression ratio (higher is better)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.17x    |
+----------+------------+

Warnings

Metrics over time

../test-dynamo-runner-logs-3/passrate_over_time.png : ![](https://i.imgur.com/Hq69IeS.png) ../test-dynamo-runner-logs-3/geomean_over_time.png : ![](https://i.imgur.com/dFmly15.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.4236 | +--------------+----+----------+ ~~~ Accuracy ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 2 | pass | +--------------+----+----------+ ~~~ Compilation latency (sec) ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 18.231 | +--------------+----+----------+ ~~~ Peak Memory Compression Ratio ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.1741 | +--------------+----+----------+ ~~~

Performance graphs

../test-dynamo-runner-logs-3/torchbench_float32.png : ![](https://i.imgur.com/9K03i5M.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor | 100%, 1/1  |
+----------+------------+

Geometric mean speedup

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.42x    |
+----------+------------+

Mean compilation time (seconds)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   17.48    |
+----------+------------+

Peak memory footprint compression ratio (higher is better)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.17x    |
+----------+------------+

Mean absolute latency (seconds)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |    0.05    |
+----------+------------+

Warnings

Metrics over time

../test-dynamo-runner-logs-7/passrate_over_time.png : ![](https://i.imgur.com/26OrbJE.png) ../test-dynamo-runner-logs-7/geomean_over_time.png : ![](https://i.imgur.com/M7CZ5ca.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.4188 | +--------------+----+----------+ ~~~ Accuracy ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 2 | pass | +--------------+----+----------+ ~~~ Compilation latency (sec) ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 17.4844 | +--------------+----+----------+ ~~~ Peak Memory Compression Ratio ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.1743 | +--------------+----+----------+ ~~~ Absolute latency (sec) ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 0.0504 | +--------------+----+----------+ ~~~

Performance graphs

../test-dynamo-runner-logs-7/torchbench_float32.png : ![](https://i.imgur.com/Q0RwZF3.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor | 100%, 1/1  |
+----------+------------+

Geometric mean speedup

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.42x    |
+----------+------------+

Mean compilation time (seconds)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   17.28    |
+----------+------------+

Peak memory footprint compression ratio (higher is better)

+----------+------------+
| Compiler | torchbench |
+----------+------------+
| inductor |   1.17x    |
+----------+------------+

Warnings

Metrics over time

../test-dynamo-runner-logs-7/passrate_over_time.png : ![](https://i.imgur.com/HoZCe3x.png) ../test-dynamo-runner-logs-7/geomean_over_time.png : ![](https://i.imgur.com/mowmZgy.png)

Accuracy Regressions

torchbench suite with float32 precision

Performance speedup ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.4213 | +--------------+----+----------+ ~~~ Accuracy ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 2 | pass | +--------------+----+----------+ ~~~ Compilation latency (sec) ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 17.2806 | +--------------+----+----------+ ~~~ Peak Memory Compression Ratio ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 1.1743 | +--------------+----+----------+ ~~~ Absolute latency (ms) ~~~ +--------------+----+----------+ | name | bs | inductor | +--------------+----+----------+ | mobilenet_v2 | 96 | 50.3118 | +--------------+----+----------+ ~~~

Performance graphs

../test-dynamo-runner-logs-7/torchbench_float32.png : ![](https://i.imgur.com/STLmMio.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 53/65 | 84%, 43/51  | 82%, 61/74  |
|       aot_eager        | 83%, 54/65 | 84%, 43/51  | 82%, 61/74  |
|     aot_cudagraphs     | 69%, 45/65 | 65%, 33/51  | 38%, 28/74  |
|    nvprims_nvfuser     | 48%, 31/65 | 78%, 40/51  | 26%, 19/74  |
|        inductor        | 75%, 49/65 | 82%, 42/51  | 81%, 60/74  |
| inductor_no_cudagraphs | 82%, 53/65 | 82%, 42/51  | 82%, 61/74  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.03x    |    1.11x    |
|        inductor        |   1.50x    |    1.29x    |    1.25x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.16    |    2.43     |    1.91     |
|       aot_eager        |    5.77    |    7.84     |    7.05     |
|     aot_cudagraphs     |    8.60    |    16.10    |    13.16    |
|    nvprims_nvfuser     |   73.63    |   109.11    |   124.35    |
|        inductor        |   29.31    |    29.54    |    34.71    |
| inductor_no_cudagraphs |   28.61    |    25.45    |    33.28    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.85x    |    0.87x    |    0.84x    |
|        inductor        |   0.87x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   1.01x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7378 | 0.9441 | | torchbench | soft_actor_critic | 1.4286 | 0.9322 | | torchbench | nvidia_deeprecommender | 0.9036 | 0.9642 | | torchbench | dlrm | 0.0 | 1.0444 | | torchbench | hf_GPT2_large | 0.0 | 1.4742 | | torchbench | hf_T5 | 0.0 | 1.5685 | | torchbench | tacotron2 | 0.0 | 0.9028 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.7921 | 0.8299 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5428 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 371.9531 | 363.8208 | | torchbench | timm_efficientdet | 122.8743 | 119.0122 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0018 | | torchbench | hf_Albert | 0.8836 | 1.2212 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8622 | 1.0312 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8318 | 1.1277 | | torchbench | resnext50_32x4d | 0.8302 | 0.8356 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | hf_BigBird | 0.8211 | 1.0391 | | torchbench | dcgan | 0.767 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7517 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7062 | 1.0016 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6889 | 0.916 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4214 | | torchbench | dlrm | nan | 0.7306 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8453 | 1.0606 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | BigBird | 0.821 | 1.0085 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | M2M100ForConditionalGeneration | 0.8138 | 1.0093 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | YituTechConvBert | 0.7888 | 0.8725 | | huggingface | PegasusForCausalLM | 0.7774 | 0.931 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | GoogleFnet | 0.7698 | 0.9372 | | huggingface | MT5ForConditionalGeneration | 0.763 | 0.9406 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7487 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.7238 | 0.9373 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6764 | 0.8848 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.386 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8563 | 1.0752 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | coat_lite_mini | 0.821 | 1.0246 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | resmlp_12_224 | 0.7899 | 0.7979 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7462 | 0.9008 | | timm_models | crossvit_9_240 | 0.6584 | 0.8853 | | timm_models | tnt_s_patch16_224 | nan | 0.8622 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-4/passrate_over_time.png : ![](https://i.imgur.com/DJ2AdK4.png) ../test-dynamo-runner-logs-4/geomean_over_time.png : ![](https://i.imgur.com/kSXuuVe.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that run actually the compiler) to find models where previously successful accuracy tests now fail.No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0008 | 1.0057 | 2.3434 | 0.0 | 5.2693 | 1.2666 | | timm_efficientdet | 1 | 0.9803 | 0.8926 | 1.8373 | 0.0 | 4.2948 | 1.5047 | | functorch_dp_cifar10 | 64 | 1.0098 | 1.0288 | 2.1432 | 0.0 | 3.7607 | 1.2459 | | timm_vision_transformer | 8 | 1.0061 | 0.9367 | 1.5235 | 0.6774 | 2.597 | 1.4078 | | drq | 1 | 1.0063 | 0.8655 | 1.66 | 0.701 | 2.4435 | 1.064 | | BERT_pytorch | 16 | 1.0128 | 0.888 | 1.11 | 0.9921 | 2.0945 | 2.1387 | | resnext50_32x4d | 8 | 1.0028 | 1.1006 | 1.2921 | 0.0 | 2.0234 | 1.192 | | mobilenet_v3_large | 32 | 1.0036 | 1.1076 | 1.0129 | 0.0 | 1.9873 | 1.3401 | | resnet18 | 16 | 1.0019 | 1.1088 | 1.148 | 0.0 | 1.8543 | 1.2494 | | pytorch_struct | 200 | 0.9969 | 0.7519 | 0.8876 | 0.8095 | 1.8197 | 1.1619 | | squeezenet1_1 | 32 | 0.9946 | 1.0094 | 1.0664 | 0.8555 | 1.7465 | 1.2652 | | lennard_jones | 1000 | 0.9615 | 0.8552 | 1.0328 | 0.6864 | 1.7378 | 0.9441 | | hf_T5_large | 2 | 1.0245 | 0.9081 | 0.0 | 0.9845 | 1.6753 | 1.9295 | | dcgan | 32 | 0.9805 | 1.0136 | 1.2702 | 0.7708 | 1.6664 | 1.0562 | | hf_Albert | 8 | 1.0012 | 0.9963 | 0.7507 | 1.4773 | 1.6427 | 1.6398 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9993 | 1.0074 | 1.3055 | 0.8421 | 1.6241 | 1.3441 | | speech_transformer | 32 | 1.0061 | 0.9316 | 1.5091 | 0.8117 | 1.5487 | 1.5451 | | shufflenet_v2_x1_0 | 128 | 1.0027 | 1.0438 | 0.8067 | 0.0 | 1.5411 | 1.3854 | | timm_resnest | 32 | 0.9992 | 1.0022 | 0.8044 | 0.0 | 1.5171 | 1.4537 | | hf_GPT2 | 4 | 1.0075 | 0.9813 | 0.7396 | 0.4168 | 1.4972 | 1.4989 | | timm_nfnet | 128 | 0.9995 | 1.0001 | 0.0 | 1.2476 | 1.4723 | 1.4237 | | mnasnet1_0 | 32 | 1.001 | 1.0946 | 0.8619 | 0.0 | 1.4645 | 1.2723 | | mobilenet_v2_quantized_qat | 96 | 1.0015 | 0.9797 | 0.0 | 0.0 | 1.4301 | 1.4311 | | mobilenet_v2 | 96 | 0.9996 | 0.9989 | 0.7294 | 0.0 | 1.4289 | 1.4017 | | soft_actor_critic | 256 | 0.9774 | 0.8054 | 1.0894 | 0.6863 | 1.4286 | 0.9322 | | fastNLP_Bert | 6 | 0.999 | 0.9764 | 0.7511 | 1.1759 | 1.4211 | 1.3917 | | resnet50_quantized_qat | 32 | 1.0004 | 0.973 | 0.0 | 0.0 | 1.3795 | 1.3803 | | timm_efficientnet | 32 | 0.9541 | 0.8118 | 0.6972 | 0.0 | 1.3538 | 1.195 | | LearningToPaint | 96 | 1.0012 | 1.049 | 0.8596 | 0.0 | 1.2663 | 1.1859 | | pytorch_stargan | 16 | 0.9991 | 1.0766 | 0.933 | 0.0 | 1.2614 | 1.2286 | | resnet50 | 32 | 0.999 | 0.9921 | 0.7608 | 0.0 | 1.2048 | 1.1686 | | hf_Bart | 4 | 1.0124 | 0.973 | 0.7858 | 0.7878 | 1.2029 | 1.1957 | | pytorch_unet | 1 | 0.9997 | 0.9975 | 0.8467 | 0.0 | 1.202 | 1.186 | | hf_Bert | 4 | 1.0216 | 0.9963 | 0.7315 | 0.9151 | 1.2011 | 1.1818 | | Super_SloMo | 6 | 0.9999 | 0.9982 | 0.8674 | 1.0023 | 1.1813 | 1.1645 | | hf_DistilBert | 8 | 1.0008 | 0.9567 | 0.6866 | 0.5228 | 1.1729 | 1.1789 | | vgg16 | 64 | 0.9998 | 0.999 | 0.8595 | 0.9977 | 1.1722 | 1.1668 | | alexnet | 128 | 0.999 | 0.9971 | 0.8025 | 1.0043 | 1.1602 | 1.1631 | | hf_Reformer | 4 | 0.9984 | 1.0012 | 0.9881 | 0.0 | 1.1311 | 1.14 | | timm_regnet | 32 | 0.9637 | 0.9603 | 0.7797 | 0.0 | 1.126 | 1.0908 | | Background_Matting | 4 | 1.0001 | 1.0212 | 0.8682 | 0.0 | 1.1155 | 1.1072 | | yolov3 | 16 | 1.0 | 0.9945 | 0.7916 | 1.2029 | 1.0913 | 1.0786 | | hf_BigBird | 2 | 0.9873 | 0.9345 | 0.9709 | 0.9006 | 1.0887 | 0.9962 | | attention_is_all_you_need_pytorch | 256 | 1.0003 | 0.968 | 0.756 | 0.9804 | 1.0642 | 1.0483 | | timm_vision_transformer_large | 8 | 0.9993 | 0.9953 | 0.0 | 0.976 | 1.0492 | 1.0361 | | timm_vovnet | 32 | 0.9089 | 0.9042 | 0.7153 | 0.0 | 1.007 | 1.0165 | | tts_angular | 64 | 0.9884 | 0.9598 | 0.9853 | 0.9695 | 1.0069 | 1.0177 | | demucs | 4 | 0.9995 | 0.9998 | 0.9996 | 1.0002 | 1.0002 | 1.0002 | | nvidia_deeprecommender | 256 | 0.9987 | 0.963 | 0.5847 | 0.976 | 0.9036 | 0.9642 | | dlrm | 2048 | 0.0 | 1.0515 | 0.0 | 0.9973 | 0.0 | 1.0444 | | hf_GPT2_large | 4 | 0.9991 | 0.9798 | 0.0 | 0.5989 | 0.0 | 1.4742 | | hf_T5 | 8 | 0.9993 | 0.953 | 0.0 | 1.247 | 0.0 | 1.5685 | | tacotron2 | 64 | 0.9754 | 0.8418 | 0.0 | 0.0 | 0.0 | 0.9028 | | hf_Longformer | 2 | 0.9473 | 0.8798 | 0.8034 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8614 | 7.0158 | 10.0377 | 109.6599 | 371.9531 | 363.8208 | | timm_efficientdet | 1 | 19.224 | 33.2178 | 66.224 | nan | 122.8743 | 119.0122 | | hf_T5_large | 2 | 13.8547 | 35.3758 | nan | 426.7214 | 102.4023 | 100.2926 | | timm_vision_transformer_large | 8 | 2.2387 | 11.1578 | nan | 253.043 | 50.511 | 49.0287 | | attention_is_all_you_need_pytorch | 256 | 1.1049 | 5.4952 | 8.92 | 108.9814 | 45.5129 | 44.4213 | | densenet121 | 4 | 2.0417 | 9.6272 | 15.6198 | nan | 41.6513 | 40.5756 | | timm_resnest | 32 | 0.5392 | 2.0095 | 3.0833 | nan | 39.8511 | 38.54 | | hf_BigBird | 2 | 7.4753 | 12.9008 | 25.7752 | 84.9528 | 37.7178 | 25.416 | | timm_vision_transformer | 8 | 0.7547 | 3.4535 | 4.9656 | 61.655 | 32.2756 | 29.7435 | | hf_Bart | 4 | 1.573 | 6.4352 | 10.845 | 118.5196 | 28.5612 | 27.4618 | | timm_nfnet | 128 | 1.914 | 6.2307 | nan | 131.6158 | 27.2858 | 27.0435 | | BERT_pytorch | 16 | 1.4301 | 5.9278 | 8.9954 | 83.4438 | 26.7428 | 26.3124 | | pytorch_stargan | 16 | 0.3876 | 1.7235 | 2.509 | nan | 26.573 | 26.3066 | | resnet50_quantized_qat | 32 | 1.1032 | 7.0465 | nan | nan | 26.3433 | 26.4722 | | mobilenet_v2_quantized_qat | 96 | 1.2571 | 7.2017 | nan | nan | 25.989 | 25.9592 | | fastNLP_Bert | 6 | 1.4423 | 5.23 | 9.1513 | 88.2481 | 25.6569 | 24.2144 | | speech_transformer | 32 | 1.607 | 6.8204 | 25.7941 | 117.8391 | 25.4129 | 25.0411 | | timm_regnet | 32 | 2.2012 | 6.5009 | 17.8336 | nan | 23.0356 | 22.88 | | mobilenet_v3_large | 32 | 0.8264 | 3.8889 | 5.7435 | nan | 22.7694 | 22.1405 | | timm_efficientnet | 32 | 1.6793 | 5.6688 | 13.8038 | nan | 22.1784 | 21.7219 | | pytorch_struct | 200 | 0.2413 | 0.6161 | 1.1654 | 4.0189 | 19.5008 | 18.2188 | | hf_Reformer | 4 | 1.6925 | 2.885 | 5.6044 | nan | 19.2174 | 15.9965 | | hf_Bert | 4 | 1.5142 | 5.2937 | 7.9301 | 89.0286 | 18.2225 | 17.5742 | | mnasnet1_0 | 32 | 0.763 | 3.4271 | 5.2587 | nan | 18.0671 | 17.6162 | | shufflenet_v2_x1_0 | 128 | 0.9168 | 4.0663 | 6.2239 | nan | 17.7175 | 16.8712 | | timm_vovnet | 32 | 1.4409 | 3.7788 | 8.8736 | nan | 17.5028 | 17.2754 | | resnet50 | 32 | 0.8201 | 3.7567 | 5.5967 | nan | 17.4673 | 16.9844 | | hf_Albert | 8 | 1.1841 | 4.5928 | 7.5068 | 103.8845 | 17.215 | 16.4293 | | resnext50_32x4d | 8 | 0.8406 | 3.7221 | 5.762 | nan | 16.9006 | 16.3333 | | hf_GPT2 | 4 | 1.4463 | 5.1416 | 7.63 | 69.0378 | 16.7157 | 16.1243 | | Super_SloMo | 6 | 0.9714 | 3.9762 | 5.5713 | 32.2723 | 16.4381 | 15.5588 | | Background_Matting | 4 | 0.6921 | 3.5676 | 5.501 | nan | 15.9924 | 15.0031 | | mobilenet_v2 | 96 | 0.7311 | 3.7079 | 5.8611 | nan | 15.8456 | 16.0991 | | functorch_dp_cifar10 | 64 | 0.3423 | 1.3407 | 2.0217 | nan | 12.2127 | 12.3331 | | hf_DistilBert | 8 | 0.6109 | 2.5533 | 4.5332 | 40.5139 | 11.6684 | 11.4337 | | resnet18 | 16 | 0.3851 | 1.4827 | 2.1284 | nan | 10.6175 | 10.2896 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3667 | 1.5555 | 2.2852 | 30.6837 | 7.8733 | 7.6621 | | pytorch_unet | 1 | 0.4249 | 1.6126 | 2.4816 | nan | 7.7689 | 7.4441 | | LearningToPaint | 96 | 0.4226 | 1.5308 | 2.3427 | nan | 6.8033 | 6.6982 | | squeezenet1_1 | 32 | 0.1909 | 0.659 | 1.0197 | 4.2894 | 3.9135 | 3.5092 | | drq | 1 | 0.2866 | 0.5031 | 0.8449 | 4.0736 | 3.653 | 3.3213 | | soft_actor_critic | 256 | 0.2006 | 0.2947 | 0.5216 | 1.515 | 3.364 | 2.8142 | | vgg16 | 64 | 0.186 | 0.4632 | 0.8377 | 2.7182 | 3.3332 | 3.2707 | | nvidia_deeprecommender | 256 | 0.1909 | 0.3714 | 0.6361 | 4.5277 | 3.213 | 2.9493 | | alexnet | 128 | 0.1474 | 0.3139 | 0.5577 | 2.9115 | 2.8864 | 2.6028 | | dcgan | 32 | 0.1651 | 0.3577 | 0.5697 | 4.2487 | 2.5997 | 2.3809 | | lennard_jones | 1000 | 0.1361 | 0.2436 | 0.3939 | 1.2155 | 2.0081 | 1.7488 | | tts_angular | 64 | 0.2053 | 0.2465 | 0.3741 | 1.0179 | 1.8876 | 1.7878 | | demucs | 4 | 0.2968 | 0.2938 | 0.3021 | 0.2903 | 0.204 | 0.2033 | | tacotron2 | 64 | 17.3452 | 29.1381 | nan | nan | nan | 63.1371 | | hf_GPT2_large | 4 | 5.1006 | 15.8775 | nan | 231.6096 | nan | 41.0449 | | hf_T5 | 8 | 2.4009 | 7.6274 | nan | 67.3711 | nan | 26.4199 | | dlrm | 2048 | nan | 0.7163 | nan | 2.7078 | nan | 2.9103 | | hf_Longformer | 2 | 6.1844 | 12.9431 | 57.4587 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.5819 | 1.5819 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4874 | 1.4867 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | nan | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3631 | 0.9528 | 1.2027 | 1.4002 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | nan | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.289 | nan | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.8136 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2708 | 0.8942 | 1.0389 | 1.0454 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.8898 | 1.0219 | 1.0963 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | nan | 0.9832 | 1.0394 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | nan | 0.9814 | 1.0418 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | nan | 0.9406 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8182 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3915 | 0.8952 | 0.9169 | 0.9991 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | nan | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.8807 | 0.3236 | nan | 0.8982 | 1.0018 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 0.7425 | 0.8836 | 1.2212 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | nan | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | 0.8425 | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | 0.8374 | 0.8622 | 1.0312 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | nan | 0.8564 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | nan | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | nan | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.906 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8772 | 0.3385 | 0.8568 | 0.8318 | 1.1277 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | nan | 0.8302 | 0.8356 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3505 | 0.8837 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4299 | 0.9608 | 0.8211 | 1.0391 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | nan | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3313 | 0.8772 | 0.7517 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7449 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 0.8929 | 0.7062 | 1.0016 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | nan | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9471 | 0.7168 | 0.3387 | nan | 0.6889 | 0.916 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3213 | 0.887 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | nan | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9139 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | nan | nan | 1.1623 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8724 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | 0.876 | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | 0.7305 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2944 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0344 | 0.8988 | 1.7609 | 0.7669 | 3.2462 | 1.4282 | | CamemBert | 1 | 1.0489 | 0.9111 | 1.3153 | 0.7487 | 2.3839 | 1.4892 | | MT5ForConditionalGeneration | 8 | 1.0249 | 0.9058 | 1.197 | 1.0478 | 2.2642 | 1.9968 | | DistillGPT2 | 1 | 1.0362 | 0.9281 | 1.0569 | 0.2843 | 2.1735 | 1.7704 | | MobileBertForMaskedLM | 32 | 1.0219 | 0.9277 | 1.1471 | 0.0 | 2.1432 | 1.5437 | | GoogleFnet | 1 | 0.9781 | 0.7916 | 0.9608 | 0.6787 | 1.8333 | 1.1417 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9777 | 0.0 | 0.7332 | 1.796 | 1.7868 | | M2M100ForConditionalGeneration | 8 | 1.1668 | 0.8916 | 0.8688 | 0.8792 | 1.4677 | 1.3152 | | T5ForConditionalGeneration | 4 | 1.0045 | 0.9328 | 0.7238 | 1.1659 | 1.4575 | 1.4377 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.984 | 0.0 | 1.2717 | 1.4259 | 1.4061 | | ElectraForCausalLM | 32 | 1.0002 | 0.9308 | 0.0 | 1.0449 | 1.4126 | 1.447 | | MobileBertForQuestionAnswering | 64 | 1.0269 | 0.899 | 0.8661 | 0.0 | 1.4009 | 1.3149 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9888 | 0.7371 | 1.1677 | 1.3004 | 1.2892 | | T5Small | 1 | 1.0283 | 0.898 | 1.0214 | 1.0075 | 1.2743 | 1.1416 | | AlbertForQuestionAnswering | 4 | 1.0013 | 1.0016 | 0.0 | 1.2136 | 1.2615 | 1.259 | | AlbertForMaskedLM | 4 | 1.0002 | 0.9995 | 0.0 | 1.2086 | 1.2555 | 1.2542 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9694 | 0.0 | 1.0981 | 1.2117 | 1.2128 | | PLBartForConditionalGeneration | 16 | 1.0171 | 0.9677 | 0.82 | 0.8295 | 1.2074 | 1.2039 | | OPTForCausalLM | 32 | 1.001 | 0.9321 | 0.7133 | 0.4583 | 1.1814 | 1.2322 | | XGLMForCausalLM | 8 | 1.0134 | 0.8793 | 0.7416 | 0.3262 | 1.1703 | 1.183 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.985 | 0.713 | 0.5283 | 1.1701 | 1.151 | | RobertaForCausalLM | 64 | 1.0005 | 0.9613 | 0.7458 | 0.9897 | 1.1479 | 1.1508 | | MegatronBertForQuestionAnswering | 16 | 1.0391 | 1.0134 | 0.7678 | 0.904 | 1.1423 | 1.1242 | | Speech2Text2ForCausalLM | 128 | 0.9987 | 0.9247 | 0.6616 | 0.9473 | 1.1342 | 1.152 | | MegatronBertForCausalLM | 16 | 1.0352 | 1.0109 | 0.7389 | 0.9715 | 1.1289 | 1.1169 | | BertForQuestionAnswering | 128 | 1.0003 | 0.9934 | 0.0 | 1.0534 | 1.1144 | 1.1076 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9929 | 0.0 | 1.0538 | 1.1124 | 1.1142 | | BartForConditionalGeneration | 2 | 1.0002 | 0.9869 | 0.0 | 0.4455 | 1.1005 | 1.0887 | | BartForCausalLM | 4 | 1.0008 | 0.9659 | 0.7558 | 1.0034 | 1.0903 | 1.1102 | | BigBird | 1 | 0.9842 | 0.9253 | 0.9888 | 0.8937 | 1.0902 | 0.9951 | | PegasusForConditionalGeneration | 16 | 1.01 | 0.9642 | 0.7552 | 0.9091 | 1.0885 | 1.0682 | | MBartForConditionalGeneration | 16 | 1.0101 | 0.9844 | 0.7644 | 0.9354 | 1.0882 | 1.1586 | | DebertaForMaskedLM | 4 | 0.9045 | 0.7846 | 0.723 | 0.6431 | 1.0785 | 1.0406 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9255 | 0.0 | 0.9561 | 1.0642 | 1.0726 | | BertForMaskedLM | 64 | 1.0001 | 0.9609 | 0.7301 | 0.9877 | 1.0587 | 1.0605 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.9507 | 0.7124 | 0.618 | 1.0496 | 1.0677 | | DebertaForQuestionAnswering | 8 | 0.996 | 0.966 | 0.6825 | 0.8678 | 1.0489 | 1.2207 | | PLBartForCausalLM | 32 | 1.0063 | 0.9333 | 0.718 | 0.9233 | 1.0279 | 1.0546 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9104 | 0.6832 | 0.9228 | 1.0063 | 1.043 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9558 | 0.7333 | 0.9509 | 1.0037 | 1.014 | | MBartForCausalLM | 32 | 1.0004 | 0.9539 | 0.7319 | 0.956 | 0.9984 | 1.0098 | | PegasusForCausalLM | 32 | 0.9994 | 0.9522 | 0.7318 | 0.9518 | 0.991 | 1.0027 | | AllenaiLongformerBase | 1 | 0.9248 | 0.8421 | 0.7665 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BigBird | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8696 | 10.4793 | 34.4488 | 80.102 | 95.1799 | 33.9856 | | DebertaForMaskedLM | 4 | 4.8903 | 10.1684 | 39.0136 | 82.2904 | 89.5165 | 32.8486 | | XGLMForCausalLM | 8 | 2.4373 | 10.1151 | 22.2057 | 184.0347 | 67.472 | 64.5787 | | M2M100ForConditionalGeneration | 8 | 2.6255 | 12.8551 | 20.3877 | 240.1962 | 50.6846 | 53.7745 | | MobileBertForMaskedLM | 32 | 8.2909 | 24.9714 | 41.3933 | nan | 48.7079 | 47.9116 | | MobileBertForQuestionAnswering | 64 | 8.3929 | 23.8292 | 41.2316 | nan | 48.0873 | 48.1985 | | BartForConditionalGeneration | 2 | 3.0164 | 12.3913 | nan | 261.2649 | 43.1544 | 40.7754 | | PegasusForConditionalGeneration | 16 | 2.8098 | 12.217 | 20.3838 | 266.7147 | 42.5357 | 39.3584 | | MBartForConditionalGeneration | 16 | 3.0064 | 12.8568 | 22.1754 | 271.3029 | 41.265 | 39.9633 | | YituTechConvBert | 1 | 2.2847 | 8.4615 | 12.8935 | 128.5077 | 39.1252 | 36.8927 | | BigBird | 1 | 7.4673 | 13.2271 | 25.8571 | 97.2564 | 37.3978 | 24.4872 | | MegatronBertForCausalLM | 16 | 3.25 | 10.8935 | 16.6483 | 190.2921 | 32.5107 | 31.4699 | | MegatronBertForQuestionAnswering | 16 | 3.2629 | 10.8829 | 17.1363 | 188.8132 | 32.2158 | 30.6754 | | MT5ForConditionalGeneration | 8 | 3.7736 | 11.2854 | 17.9664 | 104.4138 | 31.3498 | 30.5518 | | T5ForConditionalGeneration | 4 | 2.4031 | 8.0927 | 12.7737 | 67.9725 | 29.6106 | 28.1192 | | BlenderbotSmallForConditionalGeneration | 64 | 1.9057 | 8.3398 | nan | 164.3311 | 28.9149 | 27.9222 | | T5Small | 1 | 2.4009 | 7.7054 | 11.553 | 70.5699 | 28.2884 | 27.324 | | LayoutLMForSequenceClassification | 16 | 1.8371 | 5.7627 | 9.2001 | 90.5694 | 27.2105 | 25.9046 | | PLBartForConditionalGeneration | 16 | 1.6054 | 6.6586 | 10.115 | 117.193 | 25.7334 | 25.1247 | | ElectraForCausalLM | 32 | 1.5128 | 5.4868 | nan | 88.7785 | 25.6426 | 23.597 | | PegasusForCausalLM | 32 | 1.1507 | 4.9241 | 7.9631 | 86.0692 | 21.1082 | 19.9817 | | MBartForCausalLM | 32 | 1.1314 | 4.719 | 7.6295 | 89.0267 | 20.6058 | 20.1791 | | GoogleFnet | 1 | 0.9536 | 2.926 | 9.0179 | 70.125 | 20.3296 | 13.4172 | | LayoutLMForMaskedLM | 16 | 1.9758 | 5.8564 | nan | 87.4187 | 20.3206 | 19.4557 | | BertForMaskedLM | 64 | 1.5049 | 5.2893 | 7.9608 | 90.3134 | 19.7607 | 19.0687 | | TrOCRForCausalLM | 32 | 1.1652 | 4.9065 | 7.5491 | 89.377 | 19.5229 | 18.252 | | ElectraForQuestionAnswering | 64 | 1.495 | 5.3576 | nan | 87.376 | 19.2805 | 18.7191 | | RobertaForCausalLM | 64 | 1.4981 | 5.9284 | 8.172 | 90.7714 | 19.2058 | 18.4299 | | BertForQuestionAnswering | 128 | 1.4996 | 5.37 | nan | 86.7613 | 19.0338 | 18.2877 | | BartForCausalLM | 4 | 1.2393 | 4.7412 | 7.3513 | 89.429 | 18.9341 | 18.4051 | | RobertaForQuestionAnswering | 128 | 1.5276 | 5.5296 | nan | 89.6935 | 18.2219 | 17.4613 | | CamemBert | 1 | 1.5741 | 5.4813 | 7.5863 | 97.4246 | 17.7886 | 18.1791 | | OPTForCausalLM | 32 | 1.2069 | 4.8382 | 9.4313 | 85.7846 | 17.089 | 16.6391 | | GPT2ForSequenceClassification | 4 | 1.4922 | 5.3664 | nan | 70.7582 | 16.288 | 15.8247 | | AlbertForMaskedLM | 4 | 1.2941 | 4.7028 | nan | 103.089 | 16.2048 | 15.0298 | | AlbertForQuestionAnswering | 4 | 1.2907 | 4.7446 | nan | 100.8213 | 15.793 | 14.9597 | | Speech2Text2ForCausalLM | 128 | 0.7228 | 2.6601 | 4.147 | 36.6383 | 14.6927 | 13.351 | | BlenderbotSmallForCausalLM | 64 | 0.7996 | 3.2077 | 4.9571 | 54.4288 | 14.2352 | 13.6993 | | PLBartForCausalLM | 32 | 0.6579 | 2.7688 | 3.8771 | 42.7143 | 13.2429 | 13.0004 | | DistillGPT2 | 1 | 0.8116 | 2.6374 | 3.9301 | 39.9989 | 12.4299 | 12.0662 | | DistilBertForMaskedLM | 64 | 0.6267 | 2.6339 | 4.5363 | 42.5658 | 11.315 | 10.7331 | | DistilBertForQuestionAnswering | 64 | 0.6283 | 2.6887 | 4.4822 | 39.0327 | 10.7323 | 10.169 | | AllenaiLongformerBase | 1 | 6.2745 | 13.2036 | 57.3764 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8955 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.5681 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.8577 | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 0.9642 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.5667 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9093 | 0.8215 | 1.1049 | | BigBird | 1 | 0.9979 | 0.9536 | 0.4208 | 0.9117 | 0.821 | 1.0085 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | M2M100ForConditionalGeneration | 8 | 1.0217 | 0.9507 | 0.3799 | 0.9742 | 0.8138 | 1.0093 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3769 | 0.76 | 0.8057 | 0.9257 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7909 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9863 | 0.8573 | 0.3681 | 0.8286 | 0.7888 | 0.8725 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9232 | 0.7774 | 0.931 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3715 | 0.9293 | 0.7698 | 0.9372 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4151 | 0.8853 | 0.763 | 0.9406 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.8549 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | 0.7949 | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.861 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.3581 | 0.872 | 0.7238 | 0.9373 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.8566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.9204 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.8713 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.8956 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8401 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.9357 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8975 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.8883 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 0.89 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.8898 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.8765 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8657 | 0.3606 | 0.7895 | 0.6764 | 0.8848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.8016 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.855 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.8538 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9843 | 0.3552 | 0.9262 | 0.386 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3208 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9731 | 0.8183 | 0.0 | 1.8718 | 1.8284 | | lcnet_050 | 128 | 0.9558 | 0.9489 | 0.7699 | 1.3477 | 1.6601 | 1.6228 | | regnety_002 | 128 | 0.9757 | 1.0017 | 0.8619 | 0.0 | 1.4928 | 1.3259 | | dm_nfnet_f0 | 128 | 0.9999 | 0.9997 | 0.0 | 1.2524 | 1.4716 | 1.4239 | | xcit_large_24_p8_224 | 5 | 1.0025 | 0.9839 | 0.7787 | 0.0 | 1.4359 | 1.3257 | | hrnet_w18 | 128 | 0.9999 | 0.9983 | 0.0 | 0.0 | 1.4165 | 1.3777 | | dla102 | 128 | 0.9999 | 1.0006 | 0.0 | 0.0 | 1.3836 | 1.3692 | | volo_d1_224 | 64 | 1.0 | 0.9945 | 0.802 | 0.0 | 1.3817 | 1.36 | | nfnet_l0 | 128 | 0.9996 | 0.789 | 0.0 | 1.2306 | 1.3724 | 1.3282 | | res2net50_14w_8s | 128 | 0.9998 | 0.9992 | 0.0 | 0.0 | 1.3566 | 1.3244 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.9618 | 0.7658 | 0.0 | 1.3373 | 1.3431 | | mobilenetv2_100 | 128 | 0.9647 | 0.9637 | 0.7075 | 0.0 | 1.3369 | 1.354 | | coat_lite_mini | 128 | 0.9999 | 0.9834 | 0.8344 | 1.1056 | 1.333 | 1.3212 | | inception_v3 | 128 | 0.9999 | 0.996 | 0.0 | 0.0 | 1.3299 | 1.3084 | | gluon_inception_v3 | 128 | 0.9999 | 0.9984 | 0.0 | 0.0 | 1.3281 | 1.3084 | | adv_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 0.0 | 1.3237 | 1.3076 | | crossvit_9_240 | 128 | 0.9997 | 0.9982 | 0.7599 | 1.0529 | 1.3213 | 1.3008 | | resnest101e | 64 | 0.9996 | 1.003 | 0.0 | 0.0 | 1.3157 | 1.2707 | | res2next50 | 128 | 0.9999 | 1.0007 | 0.0 | 0.0 | 1.3098 | 1.2736 | | jx_nest_base | 32 | 1.0003 | 0.9955 | 0.7311 | 0.0 | 1.2777 | 1.2486 | | fbnetv3_b | 128 | 0.9642 | 0.9607 | 0.7578 | 0.0 | 1.2759 | 1.2981 | | sebotnet33ts_256 | 64 | 0.9758 | 0.803 | 0.0 | 0.0 | 1.2673 | 1.2692 | | selecsls42b | 128 | 0.9999 | 0.9988 | 0.8164 | 0.0 | 1.2673 | 1.2531 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7712 | 0.0 | 0.0 | 1.2659 | 1.2526 | | gmixer_24_224 | 128 | 0.9999 | 0.8097 | 0.0 | 1.0484 | 1.2617 | 1.2341 | | eca_halonext26ts | 128 | 0.9871 | 0.7786 | 0.0 | 0.0 | 1.2592 | 1.244 | | botnet26t_256 | 128 | 0.9856 | 0.9814 | 0.7881 | 0.0 | 1.2575 | 1.2606 | | mnasnet_100 | 128 | 0.966 | 0.9637 | 0.7877 | 0.0 | 1.2555 | 1.2822 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7831 | 0.0 | 0.0 | 1.2551 | 1.2683 | | fbnetc_100 | 128 | 0.9669 | 0.9628 | 0.7918 | 0.0 | 1.2497 | 1.2646 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.9776 | 0.7447 | 0.0 | 1.2409 | 1.2475 | | spnasnet_100 | 128 | 0.961 | 0.9576 | 0.775 | 0.0 | 1.2373 | 1.253 | | res2net101_26w_4s | 64 | 0.9999 | 0.9971 | 0.7756 | 0.0 | 1.2236 | 1.1884 | | convit_base | 64 | 0.9997 | 0.9981 | 0.0 | 1.3105 | 1.2196 | 1.2094 | | rexnet_100 | 128 | 0.9732 | 0.8157 | 0.0 | 0.0 | 1.212 | 1.2191 | | cspdarknet53 | 64 | 0.9582 | 0.9523 | 0.737 | 1.2258 | 1.2104 | 1.2375 | | pnasnet5large | 16 | 0.9996 | 0.9982 | 0.0 | 0.0 | 1.2101 | 1.1942 | | twins_pcpvt_base | 64 | 1.0 | 0.9981 | 0.7489 | 1.0218 | 1.2084 | 1.1684 | | gmlp_s16_224 | 128 | 1.0 | 0.9493 | 0.0 | 1.0772 | 1.2002 | 1.1894 | | tinynet_a | 128 | 0.966 | 0.7753 | 0.6219 | 0.0 | 1.1899 | 1.194 | | dpn107 | 32 | 0.9577 | 0.9506 | 0.7805 | 0.0 | 1.1877 | 1.1992 | | pit_b_224 | 64 | 1.0003 | 0.9992 | 0.0 | 1.0508 | 1.1876 | 1.1775 | | cait_m36_384 | 4 | 1.0001 | 1.0266 | 0.0 | 1.0929 | 1.1807 | 1.157 | | repvgg_a2 | 128 | 0.964 | 0.9623 | 0.8285 | 1.1371 | 1.1713 | 1.1687 | | tf_mixnet_l | 128 | 0.9856 | 0.8896 | 0.0 | 0.0 | 1.1693 | 1.167 | | mobilevit_s | 64 | 0.9791 | 0.7621 | 0.0 | 0.0 | 1.1676 | 1.1689 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.0 | 0.0 | 1.1668 | 1.1468 | | mixnet_l | 128 | 0.9848 | 0.8855 | 0.0 | 0.0 | 1.1503 | 1.1485 | | swin_base_patch4_window7_224 | 64 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.1363 | 1.1333 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9823 | 0.0 | 0.9404 | 1.1137 | 1.1025 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1075 | 1.0713 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9984 | 0.7679 | 1.0025 | 1.0947 | 1.0821 | | gluon_xception65 | 32 | 0.9999 | 0.997 | 0.0 | 0.0 | 1.0869 | 1.0755 | | vit_base_patch16_224 | 64 | 1.0002 | 0.9981 | 0.7651 | 0.9715 | 1.0864 | 1.0709 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0776 | 1.0742 | | gernet_l | 128 | 0.9739 | 0.9725 | 0.8228 | 0.0 | 1.076 | 1.0708 | | convnext_base | 64 | 0.9999 | 0.9984 | 0.0 | 1.2056 | 1.074 | 1.0694 | | mixer_b16_224 | 128 | 1.0 | 0.9778 | 0.0 | 0.9032 | 1.0662 | 1.0611 | | visformer_small | 128 | 0.9996 | 1.0017 | 0.798 | 0.0 | 1.0471 | 1.0124 | | resmlp_12_224 | 128 | 0.9998 | 0.8547 | 0.612 | 1.0527 | 0.7921 | 0.8299 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.5428 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.6758 | 24.2259 | nan | nan | 97.9129 | 94.4011 | | swin_base_patch4_window7_224 | 64 | 2.5127 | 11.1487 | nan | nan | 74.4331 | 73.0858 | | mobilevit_s | 64 | 1.6771 | 5.9554 | nan | nan | 72.5534 | 70.5904 | | xcit_large_24_p8_224 | 5 | 2.5972 | 13.7943 | 26.3818 | nan | 72.1378 | 68.2345 | | pnasnet5large | 16 | 4.4234 | 18.195 | nan | nan | 70.2334 | 66.2853 | | twins_pcpvt_base | 64 | 2.2111 | 10.3305 | 18.7337 | 305.7279 | 61.7391 | 61.8269 | | cait_m36_384 | 4 | 2.6499 | 14.2511 | nan | 341.4789 | 60.2508 | 58.4439 | | convnext_base | 64 | 1.1844 | 5.1544 | nan | 114.4446 | 59.2765 | 58.0018 | | resnest101e | 64 | 3.1624 | 12.8703 | nan | nan | 55.191 | 53.935 | | jx_nest_base | 32 | 1.7818 | 7.4274 | 13.427 | nan | 53.2076 | 50.6647 | | res2net101_26w_4s | 64 | 2.981 | 13.2332 | 22.8468 | nan | 52.9602 | 48.8257 | | res2net50_14w_8s | 128 | 2.5697 | 12.0241 | nan | nan | 47.7386 | 44.5669 | | coat_lite_mini | 128 | 1.1213 | 4.1486 | 6.5904 | 85.8414 | 47.2758 | 47.0832 | | sebotnet33ts_256 | 64 | 1.5707 | 5.4185 | nan | nan | 46.5574 | 45.5205 | | eca_halonext26ts | 128 | 1.4776 | 4.5311 | nan | nan | 46.4726 | 45.7912 | | poolformer_m36 | 64 | 1.8539 | 7.0827 | nan | nan | 43.8955 | 43.9633 | | gmlp_s16_224 | 128 | 0.9854 | 5.1687 | nan | 119.5521 | 39.3374 | 37.3765 | | eca_botnext26ts_256 | 128 | 1.3412 | 4.4601 | nan | nan | 38.723 | 37.7321 | | dpn107 | 32 | 3.7593 | 11.499 | 35.9943 | nan | 37.6314 | 35.6378 | | fbnetv3_b | 128 | 2.9909 | 9.2712 | 25.5834 | nan | 37.0176 | 32.9729 | | crossvit_9_240 | 128 | 1.3783 | 6.3515 | 10.4875 | 151.9715 | 36.5609 | 34.4028 | | botnet26t_256 | 128 | 1.3216 | 3.6816 | 8.3269 | nan | 35.2519 | 35.0131 | | volo_d1_224 | 64 | 1.3955 | 6.0708 | 9.9995 | nan | 35.1971 | 32.658 | | gluon_xception65 | 32 | 1.7601 | 8.7117 | nan | nan | 34.1982 | 32.1738 | | adv_inception_v3 | 128 | 1.5995 | 6.837 | nan | nan | 32.8611 | 30.2135 | | inception_v3 | 128 | 1.5111 | 7.0072 | nan | nan | 31.8237 | 30.8443 | | gluon_inception_v3 | 128 | 1.502 | 6.873 | nan | nan | 31.6109 | 30.9949 | | ghostnet_100 | 128 | 2.6525 | 7.9379 | 12.7268 | nan | 31.1712 | 30.1409 | | tf_mixnet_l | 128 | 5.5719 | 11.2945 | nan | nan | 30.7085 | 29.4208 | | dla102 | 128 | 1.7017 | 7.6465 | nan | nan | 29.7943 | 28.6222 | | mixnet_l | 128 | 5.2959 | 10.8869 | nan | nan | 29.5401 | 28.9401 | | gmixer_24_224 | 128 | 1.0432 | 5.7894 | nan | 119.8018 | 29.2157 | 28.4857 | | swsl_resnext101_32x16d | 32 | 1.6284 | 7.482 | nan | nan | 28.6665 | 27.2489 | | dm_nfnet_f0 | 128 | 2.0469 | 6.6058 | nan | 131.7866 | 28.4855 | 27.7133 | | convit_base | 64 | 1.0715 | 4.7688 | nan | 99.9877 | 27.4687 | 26.505 | | res2next50 | 128 | 1.5744 | 6.7033 | nan | nan | 27.3919 | 25.8268 | | tinynet_a | 128 | 1.9908 | 6.5318 | 17.5592 | nan | 25.4807 | 24.1305 | | rexnet_100 | 128 | 1.8109 | 6.1602 | nan | nan | 25.4385 | 24.8808 | | tf_efficientnet_b0 | 128 | 1.7427 | 5.6416 | nan | nan | 22.605 | 21.0662 | | cspdarknet53 | 64 | 2.1923 | 6.3219 | 16.6104 | 111.791 | 22.3753 | 21.0157 | | resmlp_12_224 | 128 | 0.6079 | 2.4407 | 3.9406 | 29.6501 | 22.1366 | 20.9371 | | mixer_b16_224 | 128 | 0.6668 | 2.6879 | nan | 60.4097 | 22.033 | 20.1957 | | visformer_small | 128 | 0.927 | 3.4075 | 5.4637 | nan | 21.353 | 20.4765 | | nfnet_l0 | 128 | 1.7629 | 6.1974 | nan | 119.5436 | 21.0405 | 19.9757 | | convmixer_768_32 | 32 | 1.0919 | 4.9011 | nan | nan | 21.0102 | 19.7444 | | spnasnet_100 | 128 | 1.8763 | 5.353 | 14.956 | nan | 20.6172 | 19.5156 | | fbnetc_100 | 128 | 1.9551 | 5.5139 | 15.2878 | nan | 20.5856 | 19.8788 | | mobilenetv3_large_100 | 128 | 1.4853 | 4.5798 | 11.7251 | nan | 19.5646 | 18.995 | | beit_base_patch16_224 | 64 | 1.0998 | 4.2197 | nan | 76.8368 | 19.3602 | 18.6469 | | deit_base_distilled_patch16_224 | 64 | 0.8309 | 3.4963 | 5.8135 | 64.2137 | 19.3532 | 18.1886 | | mnasnet_100 | 128 | 1.5433 | 4.4024 | 11.6242 | nan | 18.6775 | 16.7912 | | vit_base_patch16_224 | 64 | 0.8307 | 3.4866 | 6.1571 | 62.9057 | 18.5383 | 17.9197 | | mobilenetv2_100 | 128 | 1.6797 | 4.5132 | 11.6029 | nan | 18.3415 | 17.3047 | | repvgg_a2 | 128 | 1.8844 | 5.2869 | 13.9984 | 216.8296 | 17.6041 | 16.9046 | | pit_b_224 | 64 | 0.9745 | 3.9159 | nan | 82.2597 | 17.3771 | 16.6796 | | gernet_l | 128 | 1.8878 | 5.0447 | 13.713 | nan | 17.0365 | 16.3314 | | regnety_002 | 128 | 1.5071 | 4.5384 | 11.3701 | nan | 16.9243 | 16.4111 | | selecsls42b | 128 | 0.8012 | 2.9711 | 4.8528 | nan | 15.2159 | 14.5701 | | lcnet_050 | 128 | 0.9774 | 2.8222 | 6.6583 | 67.5505 | 13.0877 | 12.0854 | | ese_vovnet19b_dw | 128 | 0.9845 | 2.526 | 5.9736 | nan | 12.4852 | 11.6486 | | tnt_s_patch16_224 | 128 | 1.546 | 8.1097 | nan | nan | nan | 31.7087 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 0.9166 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | nan | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.8322 | 1.2911 | 1.4945 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | nan | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | nan | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | nan | 1.1771 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | nan | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9938 | 0.7687 | nan | nan | 1.1376 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | 0.933 | 1.1184 | 1.1751 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0526 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.8897 | 1.0218 | 1.0961 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9286 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 0.9163 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | nan | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 0.9167 | 0.9926 | 1.0799 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8423 | 0.9924 | 1.0856 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | nan | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 0.8965 | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | nan | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9209 | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9831 | 0.9169 | nan | nan | 0.9632 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | nan | 0.952 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | nan | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | nan | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | nan | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | nan | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2863 | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | nan | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | nan | 0.9285 | 1.015 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0634 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0635 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9137 | 1.0634 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.8692 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | nan | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0515 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | nan | 0.9069 | 1.0618 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | nan | 0.9059 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8297 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | nan | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | nan | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | nan | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | nan | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | nan | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.7501 | 0.8563 | 1.0752 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7284 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | nan | 0.7928 | 0.9926 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6275 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.7257 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8762 | 0.7462 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 0.8418 | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-4/huggingface_float32.png : ![](https://i.imgur.com/pcfkjwT.png) ../test-dynamo-runner-logs-4/timm_models_float32.png : ![](https://i.imgur.com/zhFkGPw.png) ../test-dynamo-runner-logs-4/torchbench_float32.png : ![](https://i.imgur.com/W4iUAqD.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 53/65 | 84%, 43/51  | 82%, 61/74  |
|       aot_eager        | 83%, 54/65 | 84%, 43/51  | 82%, 61/74  |
|     aot_cudagraphs     | 69%, 45/65 | 65%, 33/51  | 38%, 28/74  |
|    nvprims_nvfuser     | 48%, 31/65 | 78%, 40/51  | 26%, 19/74  |
|        inductor        | 75%, 49/65 | 82%, 42/51  | 81%, 60/74  |
| inductor_no_cudagraphs | 82%, 53/65 | 82%, 42/51  | 82%, 61/74  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.03x    |    1.11x    |
|        inductor        |   1.50x    |    1.29x    |    1.25x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.16    |    2.43     |    1.91     |
|       aot_eager        |    5.77    |    7.84     |    7.05     |
|     aot_cudagraphs     |    8.60    |    16.10    |    13.16    |
|    nvprims_nvfuser     |   73.63    |   109.11    |   124.35    |
|        inductor        |   29.31    |    29.54    |    34.71    |
| inductor_no_cudagraphs |   28.61    |    25.45    |    33.28    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.85x    |    0.87x    |    0.84x    |
|        inductor        |   0.87x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   1.01x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7378 | 0.9441 | | torchbench | soft_actor_critic | 1.4286 | 0.9322 | | torchbench | nvidia_deeprecommender | 0.9036 | 0.9642 | | torchbench | dlrm | 0.0 | 1.0444 | | torchbench | hf_GPT2_large | 0.0 | 1.4742 | | torchbench | hf_T5 | 0.0 | 1.5685 | | torchbench | tacotron2 | 0.0 | 0.9028 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.7921 | 0.8299 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5428 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 371.9531 | 363.8208 | | torchbench | timm_efficientdet | 122.8743 | 119.0122 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0018 | | torchbench | hf_Albert | 0.8836 | 1.2212 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8622 | 1.0312 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8318 | 1.1277 | | torchbench | resnext50_32x4d | 0.8302 | 0.8356 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | hf_BigBird | 0.8211 | 1.0391 | | torchbench | dcgan | 0.767 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7517 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7062 | 1.0016 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6889 | 0.916 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4214 | | torchbench | dlrm | nan | 0.7306 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8453 | 1.0606 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | BigBird | 0.821 | 1.0085 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | M2M100ForConditionalGeneration | 0.8138 | 1.0093 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | YituTechConvBert | 0.7888 | 0.8725 | | huggingface | PegasusForCausalLM | 0.7774 | 0.931 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | GoogleFnet | 0.7698 | 0.9372 | | huggingface | MT5ForConditionalGeneration | 0.763 | 0.9406 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7487 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.7238 | 0.9373 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6764 | 0.8848 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.386 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8563 | 1.0752 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | coat_lite_mini | 0.821 | 1.0246 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | resmlp_12_224 | 0.7899 | 0.7979 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7462 | 0.9008 | | timm_models | crossvit_9_240 | 0.6584 | 0.8853 | | timm_models | tnt_s_patch16_224 | nan | 0.8622 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-4/passrate_over_time.png : ![](https://i.imgur.com/zkYgLTU.png) ../test-dynamo-runner-logs-4/geomean_over_time.png : ![](https://i.imgur.com/mo6Zqrd.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that run actually the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0008 | 1.0057 | 2.3434 | 0.0 | 5.2693 | 1.2666 | | timm_efficientdet | 1 | 0.9803 | 0.8926 | 1.8373 | 0.0 | 4.2948 | 1.5047 | | functorch_dp_cifar10 | 64 | 1.0098 | 1.0288 | 2.1432 | 0.0 | 3.7607 | 1.2459 | | timm_vision_transformer | 8 | 1.0061 | 0.9367 | 1.5235 | 0.6774 | 2.597 | 1.4078 | | drq | 1 | 1.0063 | 0.8655 | 1.66 | 0.701 | 2.4435 | 1.064 | | BERT_pytorch | 16 | 1.0128 | 0.888 | 1.11 | 0.9921 | 2.0945 | 2.1387 | | resnext50_32x4d | 8 | 1.0028 | 1.1006 | 1.2921 | 0.0 | 2.0234 | 1.192 | | mobilenet_v3_large | 32 | 1.0036 | 1.1076 | 1.0129 | 0.0 | 1.9873 | 1.3401 | | resnet18 | 16 | 1.0019 | 1.1088 | 1.148 | 0.0 | 1.8543 | 1.2494 | | pytorch_struct | 200 | 0.9969 | 0.7519 | 0.8876 | 0.8095 | 1.8197 | 1.1619 | | squeezenet1_1 | 32 | 0.9946 | 1.0094 | 1.0664 | 0.8555 | 1.7465 | 1.2652 | | lennard_jones | 1000 | 0.9615 | 0.8552 | 1.0328 | 0.6864 | 1.7378 | 0.9441 | | hf_T5_large | 2 | 1.0245 | 0.9081 | 0.0 | 0.9845 | 1.6753 | 1.9295 | | dcgan | 32 | 0.9805 | 1.0136 | 1.2702 | 0.7708 | 1.6664 | 1.0562 | | hf_Albert | 8 | 1.0012 | 0.9963 | 0.7507 | 1.4773 | 1.6427 | 1.6398 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9993 | 1.0074 | 1.3055 | 0.8421 | 1.6241 | 1.3441 | | speech_transformer | 32 | 1.0061 | 0.9316 | 1.5091 | 0.8117 | 1.5487 | 1.5451 | | shufflenet_v2_x1_0 | 128 | 1.0027 | 1.0438 | 0.8067 | 0.0 | 1.5411 | 1.3854 | | timm_resnest | 32 | 0.9992 | 1.0022 | 0.8044 | 0.0 | 1.5171 | 1.4537 | | hf_GPT2 | 4 | 1.0075 | 0.9813 | 0.7396 | 0.4168 | 1.4972 | 1.4989 | | timm_nfnet | 128 | 0.9995 | 1.0001 | 0.0 | 1.2476 | 1.4723 | 1.4237 | | mnasnet1_0 | 32 | 1.001 | 1.0946 | 0.8619 | 0.0 | 1.4645 | 1.2723 | | mobilenet_v2_quantized_qat | 96 | 1.0015 | 0.9797 | 0.0 | 0.0 | 1.4301 | 1.4311 | | mobilenet_v2 | 96 | 0.9996 | 0.9989 | 0.7294 | 0.0 | 1.4289 | 1.4017 | | soft_actor_critic | 256 | 0.9774 | 0.8054 | 1.0894 | 0.6863 | 1.4286 | 0.9322 | | fastNLP_Bert | 6 | 0.999 | 0.9764 | 0.7511 | 1.1759 | 1.4211 | 1.3917 | | resnet50_quantized_qat | 32 | 1.0004 | 0.973 | 0.0 | 0.0 | 1.3795 | 1.3803 | | timm_efficientnet | 32 | 0.9541 | 0.8118 | 0.6972 | 0.0 | 1.3538 | 1.195 | | LearningToPaint | 96 | 1.0012 | 1.049 | 0.8596 | 0.0 | 1.2663 | 1.1859 | | pytorch_stargan | 16 | 0.9991 | 1.0766 | 0.933 | 0.0 | 1.2614 | 1.2286 | | resnet50 | 32 | 0.999 | 0.9921 | 0.7608 | 0.0 | 1.2048 | 1.1686 | | hf_Bart | 4 | 1.0124 | 0.973 | 0.7858 | 0.7878 | 1.2029 | 1.1957 | | pytorch_unet | 1 | 0.9997 | 0.9975 | 0.8467 | 0.0 | 1.202 | 1.186 | | hf_Bert | 4 | 1.0216 | 0.9963 | 0.7315 | 0.9151 | 1.2011 | 1.1818 | | Super_SloMo | 6 | 0.9999 | 0.9982 | 0.8674 | 1.0023 | 1.1813 | 1.1645 | | hf_DistilBert | 8 | 1.0008 | 0.9567 | 0.6866 | 0.5228 | 1.1729 | 1.1789 | | vgg16 | 64 | 0.9998 | 0.999 | 0.8595 | 0.9977 | 1.1722 | 1.1668 | | alexnet | 128 | 0.999 | 0.9971 | 0.8025 | 1.0043 | 1.1602 | 1.1631 | | hf_Reformer | 4 | 0.9984 | 1.0012 | 0.9881 | 0.0 | 1.1311 | 1.14 | | timm_regnet | 32 | 0.9637 | 0.9603 | 0.7797 | 0.0 | 1.126 | 1.0908 | | Background_Matting | 4 | 1.0001 | 1.0212 | 0.8682 | 0.0 | 1.1155 | 1.1072 | | yolov3 | 16 | 1.0 | 0.9945 | 0.7916 | 1.2029 | 1.0913 | 1.0786 | | hf_BigBird | 2 | 0.9873 | 0.9345 | 0.9709 | 0.9006 | 1.0887 | 0.9962 | | attention_is_all_you_need_pytorch | 256 | 1.0003 | 0.968 | 0.756 | 0.9804 | 1.0642 | 1.0483 | | timm_vision_transformer_large | 8 | 0.9993 | 0.9953 | 0.0 | 0.976 | 1.0492 | 1.0361 | | timm_vovnet | 32 | 0.9089 | 0.9042 | 0.7153 | 0.0 | 1.007 | 1.0165 | | tts_angular | 64 | 0.9884 | 0.9598 | 0.9853 | 0.9695 | 1.0069 | 1.0177 | | demucs | 4 | 0.9995 | 0.9998 | 0.9996 | 1.0002 | 1.0002 | 1.0002 | | nvidia_deeprecommender | 256 | 0.9987 | 0.963 | 0.5847 | 0.976 | 0.9036 | 0.9642 | | dlrm | 2048 | 0.0 | 1.0515 | 0.0 | 0.9973 | 0.0 | 1.0444 | | hf_GPT2_large | 4 | 0.9991 | 0.9798 | 0.0 | 0.5989 | 0.0 | 1.4742 | | hf_T5 | 8 | 0.9993 | 0.953 | 0.0 | 1.247 | 0.0 | 1.5685 | | tacotron2 | 64 | 0.9754 | 0.8418 | 0.0 | 0.0 | 0.0 | 0.9028 | | hf_Longformer | 2 | 0.9473 | 0.8798 | 0.8034 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8614 | 7.0158 | 10.0377 | 109.6599 | 371.9531 | 363.8208 | | timm_efficientdet | 1 | 19.224 | 33.2178 | 66.224 | nan | 122.8743 | 119.0122 | | hf_T5_large | 2 | 13.8547 | 35.3758 | nan | 426.7214 | 102.4023 | 100.2926 | | timm_vision_transformer_large | 8 | 2.2387 | 11.1578 | nan | 253.043 | 50.511 | 49.0287 | | attention_is_all_you_need_pytorch | 256 | 1.1049 | 5.4952 | 8.92 | 108.9814 | 45.5129 | 44.4213 | | densenet121 | 4 | 2.0417 | 9.6272 | 15.6198 | nan | 41.6513 | 40.5756 | | timm_resnest | 32 | 0.5392 | 2.0095 | 3.0833 | nan | 39.8511 | 38.54 | | hf_BigBird | 2 | 7.4753 | 12.9008 | 25.7752 | 84.9528 | 37.7178 | 25.416 | | timm_vision_transformer | 8 | 0.7547 | 3.4535 | 4.9656 | 61.655 | 32.2756 | 29.7435 | | hf_Bart | 4 | 1.573 | 6.4352 | 10.845 | 118.5196 | 28.5612 | 27.4618 | | timm_nfnet | 128 | 1.914 | 6.2307 | nan | 131.6158 | 27.2858 | 27.0435 | | BERT_pytorch | 16 | 1.4301 | 5.9278 | 8.9954 | 83.4438 | 26.7428 | 26.3124 | | pytorch_stargan | 16 | 0.3876 | 1.7235 | 2.509 | nan | 26.573 | 26.3066 | | resnet50_quantized_qat | 32 | 1.1032 | 7.0465 | nan | nan | 26.3433 | 26.4722 | | mobilenet_v2_quantized_qat | 96 | 1.2571 | 7.2017 | nan | nan | 25.989 | 25.9592 | | fastNLP_Bert | 6 | 1.4423 | 5.23 | 9.1513 | 88.2481 | 25.6569 | 24.2144 | | speech_transformer | 32 | 1.607 | 6.8204 | 25.7941 | 117.8391 | 25.4129 | 25.0411 | | timm_regnet | 32 | 2.2012 | 6.5009 | 17.8336 | nan | 23.0356 | 22.88 | | mobilenet_v3_large | 32 | 0.8264 | 3.8889 | 5.7435 | nan | 22.7694 | 22.1405 | | timm_efficientnet | 32 | 1.6793 | 5.6688 | 13.8038 | nan | 22.1784 | 21.7219 | | pytorch_struct | 200 | 0.2413 | 0.6161 | 1.1654 | 4.0189 | 19.5008 | 18.2188 | | hf_Reformer | 4 | 1.6925 | 2.885 | 5.6044 | nan | 19.2174 | 15.9965 | | hf_Bert | 4 | 1.5142 | 5.2937 | 7.9301 | 89.0286 | 18.2225 | 17.5742 | | mnasnet1_0 | 32 | 0.763 | 3.4271 | 5.2587 | nan | 18.0671 | 17.6162 | | shufflenet_v2_x1_0 | 128 | 0.9168 | 4.0663 | 6.2239 | nan | 17.7175 | 16.8712 | | timm_vovnet | 32 | 1.4409 | 3.7788 | 8.8736 | nan | 17.5028 | 17.2754 | | resnet50 | 32 | 0.8201 | 3.7567 | 5.5967 | nan | 17.4673 | 16.9844 | | hf_Albert | 8 | 1.1841 | 4.5928 | 7.5068 | 103.8845 | 17.215 | 16.4293 | | resnext50_32x4d | 8 | 0.8406 | 3.7221 | 5.762 | nan | 16.9006 | 16.3333 | | hf_GPT2 | 4 | 1.4463 | 5.1416 | 7.63 | 69.0378 | 16.7157 | 16.1243 | | Super_SloMo | 6 | 0.9714 | 3.9762 | 5.5713 | 32.2723 | 16.4381 | 15.5588 | | Background_Matting | 4 | 0.6921 | 3.5676 | 5.501 | nan | 15.9924 | 15.0031 | | mobilenet_v2 | 96 | 0.7311 | 3.7079 | 5.8611 | nan | 15.8456 | 16.0991 | | functorch_dp_cifar10 | 64 | 0.3423 | 1.3407 | 2.0217 | nan | 12.2127 | 12.3331 | | hf_DistilBert | 8 | 0.6109 | 2.5533 | 4.5332 | 40.5139 | 11.6684 | 11.4337 | | resnet18 | 16 | 0.3851 | 1.4827 | 2.1284 | nan | 10.6175 | 10.2896 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3667 | 1.5555 | 2.2852 | 30.6837 | 7.8733 | 7.6621 | | pytorch_unet | 1 | 0.4249 | 1.6126 | 2.4816 | nan | 7.7689 | 7.4441 | | LearningToPaint | 96 | 0.4226 | 1.5308 | 2.3427 | nan | 6.8033 | 6.6982 | | squeezenet1_1 | 32 | 0.1909 | 0.659 | 1.0197 | 4.2894 | 3.9135 | 3.5092 | | drq | 1 | 0.2866 | 0.5031 | 0.8449 | 4.0736 | 3.653 | 3.3213 | | soft_actor_critic | 256 | 0.2006 | 0.2947 | 0.5216 | 1.515 | 3.364 | 2.8142 | | vgg16 | 64 | 0.186 | 0.4632 | 0.8377 | 2.7182 | 3.3332 | 3.2707 | | nvidia_deeprecommender | 256 | 0.1909 | 0.3714 | 0.6361 | 4.5277 | 3.213 | 2.9493 | | alexnet | 128 | 0.1474 | 0.3139 | 0.5577 | 2.9115 | 2.8864 | 2.6028 | | dcgan | 32 | 0.1651 | 0.3577 | 0.5697 | 4.2487 | 2.5997 | 2.3809 | | lennard_jones | 1000 | 0.1361 | 0.2436 | 0.3939 | 1.2155 | 2.0081 | 1.7488 | | tts_angular | 64 | 0.2053 | 0.2465 | 0.3741 | 1.0179 | 1.8876 | 1.7878 | | demucs | 4 | 0.2968 | 0.2938 | 0.3021 | 0.2903 | 0.204 | 0.2033 | | tacotron2 | 64 | 17.3452 | 29.1381 | nan | nan | nan | 63.1371 | | hf_GPT2_large | 4 | 5.1006 | 15.8775 | nan | 231.6096 | nan | 41.0449 | | hf_T5 | 8 | 2.4009 | 7.6274 | nan | 67.3711 | nan | 26.4199 | | dlrm | 2048 | nan | 0.7163 | nan | 2.7078 | nan | 2.9103 | | hf_Longformer | 2 | 6.1844 | 12.9431 | 57.4587 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.5819 | 1.5819 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4874 | 1.4867 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | nan | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3631 | 0.9528 | 1.2027 | 1.4002 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | nan | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.289 | nan | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.8136 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2708 | 0.8942 | 1.0389 | 1.0454 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.8898 | 1.0219 | 1.0963 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | nan | 0.9832 | 1.0394 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | nan | 0.9814 | 1.0418 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | nan | 0.9406 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8182 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3915 | 0.8952 | 0.9169 | 0.9991 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | nan | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.8807 | 0.3236 | nan | 0.8982 | 1.0018 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 0.7425 | 0.8836 | 1.2212 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | nan | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | 0.8425 | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | 0.8374 | 0.8622 | 1.0312 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | nan | 0.8564 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | nan | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | nan | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.906 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8772 | 0.3385 | 0.8568 | 0.8318 | 1.1277 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | nan | 0.8302 | 0.8356 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3505 | 0.8837 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4299 | 0.9608 | 0.8211 | 1.0391 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | nan | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3313 | 0.8772 | 0.7517 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7449 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 0.8929 | 0.7062 | 1.0016 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | nan | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9471 | 0.7168 | 0.3387 | nan | 0.6889 | 0.916 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3213 | 0.887 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | nan | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9139 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | nan | nan | 1.1623 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8724 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | 0.876 | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | 0.7305 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2944 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0344 | 0.8988 | 1.7609 | 0.7669 | 3.2462 | 1.4282 | | CamemBert | 1 | 1.0489 | 0.9111 | 1.3153 | 0.7487 | 2.3839 | 1.4892 | | MT5ForConditionalGeneration | 8 | 1.0249 | 0.9058 | 1.197 | 1.0478 | 2.2642 | 1.9968 | | DistillGPT2 | 1 | 1.0362 | 0.9281 | 1.0569 | 0.2843 | 2.1735 | 1.7704 | | MobileBertForMaskedLM | 32 | 1.0219 | 0.9277 | 1.1471 | 0.0 | 2.1432 | 1.5437 | | GoogleFnet | 1 | 0.9781 | 0.7916 | 0.9608 | 0.6787 | 1.8333 | 1.1417 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9777 | 0.0 | 0.7332 | 1.796 | 1.7868 | | M2M100ForConditionalGeneration | 8 | 1.1668 | 0.8916 | 0.8688 | 0.8792 | 1.4677 | 1.3152 | | T5ForConditionalGeneration | 4 | 1.0045 | 0.9328 | 0.7238 | 1.1659 | 1.4575 | 1.4377 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.984 | 0.0 | 1.2717 | 1.4259 | 1.4061 | | ElectraForCausalLM | 32 | 1.0002 | 0.9308 | 0.0 | 1.0449 | 1.4126 | 1.447 | | MobileBertForQuestionAnswering | 64 | 1.0269 | 0.899 | 0.8661 | 0.0 | 1.4009 | 1.3149 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9888 | 0.7371 | 1.1677 | 1.3004 | 1.2892 | | T5Small | 1 | 1.0283 | 0.898 | 1.0214 | 1.0075 | 1.2743 | 1.1416 | | AlbertForQuestionAnswering | 4 | 1.0013 | 1.0016 | 0.0 | 1.2136 | 1.2615 | 1.259 | | AlbertForMaskedLM | 4 | 1.0002 | 0.9995 | 0.0 | 1.2086 | 1.2555 | 1.2542 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9694 | 0.0 | 1.0981 | 1.2117 | 1.2128 | | PLBartForConditionalGeneration | 16 | 1.0171 | 0.9677 | 0.82 | 0.8295 | 1.2074 | 1.2039 | | OPTForCausalLM | 32 | 1.001 | 0.9321 | 0.7133 | 0.4583 | 1.1814 | 1.2322 | | XGLMForCausalLM | 8 | 1.0134 | 0.8793 | 0.7416 | 0.3262 | 1.1703 | 1.183 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.985 | 0.713 | 0.5283 | 1.1701 | 1.151 | | RobertaForCausalLM | 64 | 1.0005 | 0.9613 | 0.7458 | 0.9897 | 1.1479 | 1.1508 | | MegatronBertForQuestionAnswering | 16 | 1.0391 | 1.0134 | 0.7678 | 0.904 | 1.1423 | 1.1242 | | Speech2Text2ForCausalLM | 128 | 0.9987 | 0.9247 | 0.6616 | 0.9473 | 1.1342 | 1.152 | | MegatronBertForCausalLM | 16 | 1.0352 | 1.0109 | 0.7389 | 0.9715 | 1.1289 | 1.1169 | | BertForQuestionAnswering | 128 | 1.0003 | 0.9934 | 0.0 | 1.0534 | 1.1144 | 1.1076 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9929 | 0.0 | 1.0538 | 1.1124 | 1.1142 | | BartForConditionalGeneration | 2 | 1.0002 | 0.9869 | 0.0 | 0.4455 | 1.1005 | 1.0887 | | BartForCausalLM | 4 | 1.0008 | 0.9659 | 0.7558 | 1.0034 | 1.0903 | 1.1102 | | BigBird | 1 | 0.9842 | 0.9253 | 0.9888 | 0.8937 | 1.0902 | 0.9951 | | PegasusForConditionalGeneration | 16 | 1.01 | 0.9642 | 0.7552 | 0.9091 | 1.0885 | 1.0682 | | MBartForConditionalGeneration | 16 | 1.0101 | 0.9844 | 0.7644 | 0.9354 | 1.0882 | 1.1586 | | DebertaForMaskedLM | 4 | 0.9045 | 0.7846 | 0.723 | 0.6431 | 1.0785 | 1.0406 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9255 | 0.0 | 0.9561 | 1.0642 | 1.0726 | | BertForMaskedLM | 64 | 1.0001 | 0.9609 | 0.7301 | 0.9877 | 1.0587 | 1.0605 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.9507 | 0.7124 | 0.618 | 1.0496 | 1.0677 | | DebertaForQuestionAnswering | 8 | 0.996 | 0.966 | 0.6825 | 0.8678 | 1.0489 | 1.2207 | | PLBartForCausalLM | 32 | 1.0063 | 0.9333 | 0.718 | 0.9233 | 1.0279 | 1.0546 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9104 | 0.6832 | 0.9228 | 1.0063 | 1.043 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9558 | 0.7333 | 0.9509 | 1.0037 | 1.014 | | MBartForCausalLM | 32 | 1.0004 | 0.9539 | 0.7319 | 0.956 | 0.9984 | 1.0098 | | PegasusForCausalLM | 32 | 0.9994 | 0.9522 | 0.7318 | 0.9518 | 0.991 | 1.0027 | | AllenaiLongformerBase | 1 | 0.9248 | 0.8421 | 0.7665 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BigBird | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8696 | 10.4793 | 34.4488 | 80.102 | 95.1799 | 33.9856 | | DebertaForMaskedLM | 4 | 4.8903 | 10.1684 | 39.0136 | 82.2904 | 89.5165 | 32.8486 | | XGLMForCausalLM | 8 | 2.4373 | 10.1151 | 22.2057 | 184.0347 | 67.472 | 64.5787 | | M2M100ForConditionalGeneration | 8 | 2.6255 | 12.8551 | 20.3877 | 240.1962 | 50.6846 | 53.7745 | | MobileBertForMaskedLM | 32 | 8.2909 | 24.9714 | 41.3933 | nan | 48.7079 | 47.9116 | | MobileBertForQuestionAnswering | 64 | 8.3929 | 23.8292 | 41.2316 | nan | 48.0873 | 48.1985 | | BartForConditionalGeneration | 2 | 3.0164 | 12.3913 | nan | 261.2649 | 43.1544 | 40.7754 | | PegasusForConditionalGeneration | 16 | 2.8098 | 12.217 | 20.3838 | 266.7147 | 42.5357 | 39.3584 | | MBartForConditionalGeneration | 16 | 3.0064 | 12.8568 | 22.1754 | 271.3029 | 41.265 | 39.9633 | | YituTechConvBert | 1 | 2.2847 | 8.4615 | 12.8935 | 128.5077 | 39.1252 | 36.8927 | | BigBird | 1 | 7.4673 | 13.2271 | 25.8571 | 97.2564 | 37.3978 | 24.4872 | | MegatronBertForCausalLM | 16 | 3.25 | 10.8935 | 16.6483 | 190.2921 | 32.5107 | 31.4699 | | MegatronBertForQuestionAnswering | 16 | 3.2629 | 10.8829 | 17.1363 | 188.8132 | 32.2158 | 30.6754 | | MT5ForConditionalGeneration | 8 | 3.7736 | 11.2854 | 17.9664 | 104.4138 | 31.3498 | 30.5518 | | T5ForConditionalGeneration | 4 | 2.4031 | 8.0927 | 12.7737 | 67.9725 | 29.6106 | 28.1192 | | BlenderbotSmallForConditionalGeneration | 64 | 1.9057 | 8.3398 | nan | 164.3311 | 28.9149 | 27.9222 | | T5Small | 1 | 2.4009 | 7.7054 | 11.553 | 70.5699 | 28.2884 | 27.324 | | LayoutLMForSequenceClassification | 16 | 1.8371 | 5.7627 | 9.2001 | 90.5694 | 27.2105 | 25.9046 | | PLBartForConditionalGeneration | 16 | 1.6054 | 6.6586 | 10.115 | 117.193 | 25.7334 | 25.1247 | | ElectraForCausalLM | 32 | 1.5128 | 5.4868 | nan | 88.7785 | 25.6426 | 23.597 | | PegasusForCausalLM | 32 | 1.1507 | 4.9241 | 7.9631 | 86.0692 | 21.1082 | 19.9817 | | MBartForCausalLM | 32 | 1.1314 | 4.719 | 7.6295 | 89.0267 | 20.6058 | 20.1791 | | GoogleFnet | 1 | 0.9536 | 2.926 | 9.0179 | 70.125 | 20.3296 | 13.4172 | | LayoutLMForMaskedLM | 16 | 1.9758 | 5.8564 | nan | 87.4187 | 20.3206 | 19.4557 | | BertForMaskedLM | 64 | 1.5049 | 5.2893 | 7.9608 | 90.3134 | 19.7607 | 19.0687 | | TrOCRForCausalLM | 32 | 1.1652 | 4.9065 | 7.5491 | 89.377 | 19.5229 | 18.252 | | ElectraForQuestionAnswering | 64 | 1.495 | 5.3576 | nan | 87.376 | 19.2805 | 18.7191 | | RobertaForCausalLM | 64 | 1.4981 | 5.9284 | 8.172 | 90.7714 | 19.2058 | 18.4299 | | BertForQuestionAnswering | 128 | 1.4996 | 5.37 | nan | 86.7613 | 19.0338 | 18.2877 | | BartForCausalLM | 4 | 1.2393 | 4.7412 | 7.3513 | 89.429 | 18.9341 | 18.4051 | | RobertaForQuestionAnswering | 128 | 1.5276 | 5.5296 | nan | 89.6935 | 18.2219 | 17.4613 | | CamemBert | 1 | 1.5741 | 5.4813 | 7.5863 | 97.4246 | 17.7886 | 18.1791 | | OPTForCausalLM | 32 | 1.2069 | 4.8382 | 9.4313 | 85.7846 | 17.089 | 16.6391 | | GPT2ForSequenceClassification | 4 | 1.4922 | 5.3664 | nan | 70.7582 | 16.288 | 15.8247 | | AlbertForMaskedLM | 4 | 1.2941 | 4.7028 | nan | 103.089 | 16.2048 | 15.0298 | | AlbertForQuestionAnswering | 4 | 1.2907 | 4.7446 | nan | 100.8213 | 15.793 | 14.9597 | | Speech2Text2ForCausalLM | 128 | 0.7228 | 2.6601 | 4.147 | 36.6383 | 14.6927 | 13.351 | | BlenderbotSmallForCausalLM | 64 | 0.7996 | 3.2077 | 4.9571 | 54.4288 | 14.2352 | 13.6993 | | PLBartForCausalLM | 32 | 0.6579 | 2.7688 | 3.8771 | 42.7143 | 13.2429 | 13.0004 | | DistillGPT2 | 1 | 0.8116 | 2.6374 | 3.9301 | 39.9989 | 12.4299 | 12.0662 | | DistilBertForMaskedLM | 64 | 0.6267 | 2.6339 | 4.5363 | 42.5658 | 11.315 | 10.7331 | | DistilBertForQuestionAnswering | 64 | 0.6283 | 2.6887 | 4.4822 | 39.0327 | 10.7323 | 10.169 | | AllenaiLongformerBase | 1 | 6.2745 | 13.2036 | 57.3764 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8955 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.5681 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.8577 | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 0.9642 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.5667 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9093 | 0.8215 | 1.1049 | | BigBird | 1 | 0.9979 | 0.9536 | 0.4208 | 0.9117 | 0.821 | 1.0085 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | M2M100ForConditionalGeneration | 8 | 1.0217 | 0.9507 | 0.3799 | 0.9742 | 0.8138 | 1.0093 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3769 | 0.76 | 0.8057 | 0.9257 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7909 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9863 | 0.8573 | 0.3681 | 0.8286 | 0.7888 | 0.8725 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9232 | 0.7774 | 0.931 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3715 | 0.9293 | 0.7698 | 0.9372 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4151 | 0.8853 | 0.763 | 0.9406 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.8549 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | 0.7949 | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.861 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.3581 | 0.872 | 0.7238 | 0.9373 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.8566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.9204 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.8713 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.8956 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8401 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.9357 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8975 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.8883 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 0.89 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.8898 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.8765 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8657 | 0.3606 | 0.7895 | 0.6764 | 0.8848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.8016 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.855 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.8538 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9843 | 0.3552 | 0.9262 | 0.386 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3208 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9731 | 0.8183 | 0.0 | 1.8718 | 1.8284 | | lcnet_050 | 128 | 0.9558 | 0.9489 | 0.7699 | 1.3477 | 1.6601 | 1.6228 | | regnety_002 | 128 | 0.9757 | 1.0017 | 0.8619 | 0.0 | 1.4928 | 1.3259 | | dm_nfnet_f0 | 128 | 0.9999 | 0.9997 | 0.0 | 1.2524 | 1.4716 | 1.4239 | | xcit_large_24_p8_224 | 5 | 1.0025 | 0.9839 | 0.7787 | 0.0 | 1.4359 | 1.3257 | | hrnet_w18 | 128 | 0.9999 | 0.9983 | 0.0 | 0.0 | 1.4165 | 1.3777 | | dla102 | 128 | 0.9999 | 1.0006 | 0.0 | 0.0 | 1.3836 | 1.3692 | | volo_d1_224 | 64 | 1.0 | 0.9945 | 0.802 | 0.0 | 1.3817 | 1.36 | | nfnet_l0 | 128 | 0.9996 | 0.789 | 0.0 | 1.2306 | 1.3724 | 1.3282 | | res2net50_14w_8s | 128 | 0.9998 | 0.9992 | 0.0 | 0.0 | 1.3566 | 1.3244 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.9618 | 0.7658 | 0.0 | 1.3373 | 1.3431 | | mobilenetv2_100 | 128 | 0.9647 | 0.9637 | 0.7075 | 0.0 | 1.3369 | 1.354 | | coat_lite_mini | 128 | 0.9999 | 0.9834 | 0.8344 | 1.1056 | 1.333 | 1.3212 | | inception_v3 | 128 | 0.9999 | 0.996 | 0.0 | 0.0 | 1.3299 | 1.3084 | | gluon_inception_v3 | 128 | 0.9999 | 0.9984 | 0.0 | 0.0 | 1.3281 | 1.3084 | | adv_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 0.0 | 1.3237 | 1.3076 | | crossvit_9_240 | 128 | 0.9997 | 0.9982 | 0.7599 | 1.0529 | 1.3213 | 1.3008 | | resnest101e | 64 | 0.9996 | 1.003 | 0.0 | 0.0 | 1.3157 | 1.2707 | | res2next50 | 128 | 0.9999 | 1.0007 | 0.0 | 0.0 | 1.3098 | 1.2736 | | jx_nest_base | 32 | 1.0003 | 0.9955 | 0.7311 | 0.0 | 1.2777 | 1.2486 | | fbnetv3_b | 128 | 0.9642 | 0.9607 | 0.7578 | 0.0 | 1.2759 | 1.2981 | | sebotnet33ts_256 | 64 | 0.9758 | 0.803 | 0.0 | 0.0 | 1.2673 | 1.2692 | | selecsls42b | 128 | 0.9999 | 0.9988 | 0.8164 | 0.0 | 1.2673 | 1.2531 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7712 | 0.0 | 0.0 | 1.2659 | 1.2526 | | gmixer_24_224 | 128 | 0.9999 | 0.8097 | 0.0 | 1.0484 | 1.2617 | 1.2341 | | eca_halonext26ts | 128 | 0.9871 | 0.7786 | 0.0 | 0.0 | 1.2592 | 1.244 | | botnet26t_256 | 128 | 0.9856 | 0.9814 | 0.7881 | 0.0 | 1.2575 | 1.2606 | | mnasnet_100 | 128 | 0.966 | 0.9637 | 0.7877 | 0.0 | 1.2555 | 1.2822 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7831 | 0.0 | 0.0 | 1.2551 | 1.2683 | | fbnetc_100 | 128 | 0.9669 | 0.9628 | 0.7918 | 0.0 | 1.2497 | 1.2646 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.9776 | 0.7447 | 0.0 | 1.2409 | 1.2475 | | spnasnet_100 | 128 | 0.961 | 0.9576 | 0.775 | 0.0 | 1.2373 | 1.253 | | res2net101_26w_4s | 64 | 0.9999 | 0.9971 | 0.7756 | 0.0 | 1.2236 | 1.1884 | | convit_base | 64 | 0.9997 | 0.9981 | 0.0 | 1.3105 | 1.2196 | 1.2094 | | rexnet_100 | 128 | 0.9732 | 0.8157 | 0.0 | 0.0 | 1.212 | 1.2191 | | cspdarknet53 | 64 | 0.9582 | 0.9523 | 0.737 | 1.2258 | 1.2104 | 1.2375 | | pnasnet5large | 16 | 0.9996 | 0.9982 | 0.0 | 0.0 | 1.2101 | 1.1942 | | twins_pcpvt_base | 64 | 1.0 | 0.9981 | 0.7489 | 1.0218 | 1.2084 | 1.1684 | | gmlp_s16_224 | 128 | 1.0 | 0.9493 | 0.0 | 1.0772 | 1.2002 | 1.1894 | | tinynet_a | 128 | 0.966 | 0.7753 | 0.6219 | 0.0 | 1.1899 | 1.194 | | dpn107 | 32 | 0.9577 | 0.9506 | 0.7805 | 0.0 | 1.1877 | 1.1992 | | pit_b_224 | 64 | 1.0003 | 0.9992 | 0.0 | 1.0508 | 1.1876 | 1.1775 | | cait_m36_384 | 4 | 1.0001 | 1.0266 | 0.0 | 1.0929 | 1.1807 | 1.157 | | repvgg_a2 | 128 | 0.964 | 0.9623 | 0.8285 | 1.1371 | 1.1713 | 1.1687 | | tf_mixnet_l | 128 | 0.9856 | 0.8896 | 0.0 | 0.0 | 1.1693 | 1.167 | | mobilevit_s | 64 | 0.9791 | 0.7621 | 0.0 | 0.0 | 1.1676 | 1.1689 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.0 | 0.0 | 1.1668 | 1.1468 | | mixnet_l | 128 | 0.9848 | 0.8855 | 0.0 | 0.0 | 1.1503 | 1.1485 | | swin_base_patch4_window7_224 | 64 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.1363 | 1.1333 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9823 | 0.0 | 0.9404 | 1.1137 | 1.1025 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1075 | 1.0713 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9984 | 0.7679 | 1.0025 | 1.0947 | 1.0821 | | gluon_xception65 | 32 | 0.9999 | 0.997 | 0.0 | 0.0 | 1.0869 | 1.0755 | | vit_base_patch16_224 | 64 | 1.0002 | 0.9981 | 0.7651 | 0.9715 | 1.0864 | 1.0709 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0776 | 1.0742 | | gernet_l | 128 | 0.9739 | 0.9725 | 0.8228 | 0.0 | 1.076 | 1.0708 | | convnext_base | 64 | 0.9999 | 0.9984 | 0.0 | 1.2056 | 1.074 | 1.0694 | | mixer_b16_224 | 128 | 1.0 | 0.9778 | 0.0 | 0.9032 | 1.0662 | 1.0611 | | visformer_small | 128 | 0.9996 | 1.0017 | 0.798 | 0.0 | 1.0471 | 1.0124 | | resmlp_12_224 | 128 | 0.9998 | 0.8547 | 0.612 | 1.0527 | 0.7921 | 0.8299 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.5428 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.6758 | 24.2259 | nan | nan | 97.9129 | 94.4011 | | swin_base_patch4_window7_224 | 64 | 2.5127 | 11.1487 | nan | nan | 74.4331 | 73.0858 | | mobilevit_s | 64 | 1.6771 | 5.9554 | nan | nan | 72.5534 | 70.5904 | | xcit_large_24_p8_224 | 5 | 2.5972 | 13.7943 | 26.3818 | nan | 72.1378 | 68.2345 | | pnasnet5large | 16 | 4.4234 | 18.195 | nan | nan | 70.2334 | 66.2853 | | twins_pcpvt_base | 64 | 2.2111 | 10.3305 | 18.7337 | 305.7279 | 61.7391 | 61.8269 | | cait_m36_384 | 4 | 2.6499 | 14.2511 | nan | 341.4789 | 60.2508 | 58.4439 | | convnext_base | 64 | 1.1844 | 5.1544 | nan | 114.4446 | 59.2765 | 58.0018 | | resnest101e | 64 | 3.1624 | 12.8703 | nan | nan | 55.191 | 53.935 | | jx_nest_base | 32 | 1.7818 | 7.4274 | 13.427 | nan | 53.2076 | 50.6647 | | res2net101_26w_4s | 64 | 2.981 | 13.2332 | 22.8468 | nan | 52.9602 | 48.8257 | | res2net50_14w_8s | 128 | 2.5697 | 12.0241 | nan | nan | 47.7386 | 44.5669 | | coat_lite_mini | 128 | 1.1213 | 4.1486 | 6.5904 | 85.8414 | 47.2758 | 47.0832 | | sebotnet33ts_256 | 64 | 1.5707 | 5.4185 | nan | nan | 46.5574 | 45.5205 | | eca_halonext26ts | 128 | 1.4776 | 4.5311 | nan | nan | 46.4726 | 45.7912 | | poolformer_m36 | 64 | 1.8539 | 7.0827 | nan | nan | 43.8955 | 43.9633 | | gmlp_s16_224 | 128 | 0.9854 | 5.1687 | nan | 119.5521 | 39.3374 | 37.3765 | | eca_botnext26ts_256 | 128 | 1.3412 | 4.4601 | nan | nan | 38.723 | 37.7321 | | dpn107 | 32 | 3.7593 | 11.499 | 35.9943 | nan | 37.6314 | 35.6378 | | fbnetv3_b | 128 | 2.9909 | 9.2712 | 25.5834 | nan | 37.0176 | 32.9729 | | crossvit_9_240 | 128 | 1.3783 | 6.3515 | 10.4875 | 151.9715 | 36.5609 | 34.4028 | | botnet26t_256 | 128 | 1.3216 | 3.6816 | 8.3269 | nan | 35.2519 | 35.0131 | | volo_d1_224 | 64 | 1.3955 | 6.0708 | 9.9995 | nan | 35.1971 | 32.658 | | gluon_xception65 | 32 | 1.7601 | 8.7117 | nan | nan | 34.1982 | 32.1738 | | adv_inception_v3 | 128 | 1.5995 | 6.837 | nan | nan | 32.8611 | 30.2135 | | inception_v3 | 128 | 1.5111 | 7.0072 | nan | nan | 31.8237 | 30.8443 | | gluon_inception_v3 | 128 | 1.502 | 6.873 | nan | nan | 31.6109 | 30.9949 | | ghostnet_100 | 128 | 2.6525 | 7.9379 | 12.7268 | nan | 31.1712 | 30.1409 | | tf_mixnet_l | 128 | 5.5719 | 11.2945 | nan | nan | 30.7085 | 29.4208 | | dla102 | 128 | 1.7017 | 7.6465 | nan | nan | 29.7943 | 28.6222 | | mixnet_l | 128 | 5.2959 | 10.8869 | nan | nan | 29.5401 | 28.9401 | | gmixer_24_224 | 128 | 1.0432 | 5.7894 | nan | 119.8018 | 29.2157 | 28.4857 | | swsl_resnext101_32x16d | 32 | 1.6284 | 7.482 | nan | nan | 28.6665 | 27.2489 | | dm_nfnet_f0 | 128 | 2.0469 | 6.6058 | nan | 131.7866 | 28.4855 | 27.7133 | | convit_base | 64 | 1.0715 | 4.7688 | nan | 99.9877 | 27.4687 | 26.505 | | res2next50 | 128 | 1.5744 | 6.7033 | nan | nan | 27.3919 | 25.8268 | | tinynet_a | 128 | 1.9908 | 6.5318 | 17.5592 | nan | 25.4807 | 24.1305 | | rexnet_100 | 128 | 1.8109 | 6.1602 | nan | nan | 25.4385 | 24.8808 | | tf_efficientnet_b0 | 128 | 1.7427 | 5.6416 | nan | nan | 22.605 | 21.0662 | | cspdarknet53 | 64 | 2.1923 | 6.3219 | 16.6104 | 111.791 | 22.3753 | 21.0157 | | resmlp_12_224 | 128 | 0.6079 | 2.4407 | 3.9406 | 29.6501 | 22.1366 | 20.9371 | | mixer_b16_224 | 128 | 0.6668 | 2.6879 | nan | 60.4097 | 22.033 | 20.1957 | | visformer_small | 128 | 0.927 | 3.4075 | 5.4637 | nan | 21.353 | 20.4765 | | nfnet_l0 | 128 | 1.7629 | 6.1974 | nan | 119.5436 | 21.0405 | 19.9757 | | convmixer_768_32 | 32 | 1.0919 | 4.9011 | nan | nan | 21.0102 | 19.7444 | | spnasnet_100 | 128 | 1.8763 | 5.353 | 14.956 | nan | 20.6172 | 19.5156 | | fbnetc_100 | 128 | 1.9551 | 5.5139 | 15.2878 | nan | 20.5856 | 19.8788 | | mobilenetv3_large_100 | 128 | 1.4853 | 4.5798 | 11.7251 | nan | 19.5646 | 18.995 | | beit_base_patch16_224 | 64 | 1.0998 | 4.2197 | nan | 76.8368 | 19.3602 | 18.6469 | | deit_base_distilled_patch16_224 | 64 | 0.8309 | 3.4963 | 5.8135 | 64.2137 | 19.3532 | 18.1886 | | mnasnet_100 | 128 | 1.5433 | 4.4024 | 11.6242 | nan | 18.6775 | 16.7912 | | vit_base_patch16_224 | 64 | 0.8307 | 3.4866 | 6.1571 | 62.9057 | 18.5383 | 17.9197 | | mobilenetv2_100 | 128 | 1.6797 | 4.5132 | 11.6029 | nan | 18.3415 | 17.3047 | | repvgg_a2 | 128 | 1.8844 | 5.2869 | 13.9984 | 216.8296 | 17.6041 | 16.9046 | | pit_b_224 | 64 | 0.9745 | 3.9159 | nan | 82.2597 | 17.3771 | 16.6796 | | gernet_l | 128 | 1.8878 | 5.0447 | 13.713 | nan | 17.0365 | 16.3314 | | regnety_002 | 128 | 1.5071 | 4.5384 | 11.3701 | nan | 16.9243 | 16.4111 | | selecsls42b | 128 | 0.8012 | 2.9711 | 4.8528 | nan | 15.2159 | 14.5701 | | lcnet_050 | 128 | 0.9774 | 2.8222 | 6.6583 | 67.5505 | 13.0877 | 12.0854 | | ese_vovnet19b_dw | 128 | 0.9845 | 2.526 | 5.9736 | nan | 12.4852 | 11.6486 | | tnt_s_patch16_224 | 128 | 1.546 | 8.1097 | nan | nan | nan | 31.7087 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 0.9166 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | nan | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.8322 | 1.2911 | 1.4945 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | nan | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | nan | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | nan | 1.1771 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | nan | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9938 | 0.7687 | nan | nan | 1.1376 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | 0.933 | 1.1184 | 1.1751 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0526 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.8897 | 1.0218 | 1.0961 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9286 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 0.9163 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | nan | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 0.9167 | 0.9926 | 1.0799 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8423 | 0.9924 | 1.0856 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | nan | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 0.8965 | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | nan | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9209 | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9831 | 0.9169 | nan | nan | 0.9632 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | nan | 0.952 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | nan | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | nan | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | nan | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | nan | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2863 | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | nan | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | nan | 0.9285 | 1.015 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0634 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0635 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9137 | 1.0634 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.8692 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | nan | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0515 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | nan | 0.9069 | 1.0618 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | nan | 0.9059 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8297 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | nan | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | nan | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | nan | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | nan | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | nan | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.7501 | 0.8563 | 1.0752 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7284 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | nan | 0.7928 | 0.9926 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6275 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.7257 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8762 | 0.7462 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 0.8418 | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-4/huggingface_float32.png : ![](https://i.imgur.com/KDAcSuC.png) ../test-dynamo-runner-logs-4/timm_models_float32.png : ![](https://i.imgur.com/b1ZoPsr.png) ../test-dynamo-runner-logs-4/torchbench_float32.png : ![](https://i.imgur.com/Tt3Kmbk.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 53/65 | 84%, 43/51  | 82%, 61/74  |
|       aot_eager        | 83%, 54/65 | 84%, 43/51  | 82%, 61/74  |
|     aot_cudagraphs     | 69%, 45/65 | 65%, 33/51  | 38%, 28/74  |
|    nvprims_nvfuser     | 48%, 31/65 | 78%, 40/51  | 26%, 19/74  |
|        inductor        | 75%, 49/65 | 82%, 42/51  | 81%, 60/74  |
| inductor_no_cudagraphs | 82%, 53/65 | 82%, 42/51  | 82%, 61/74  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.03x    |    1.11x    |
|        inductor        |   1.50x    |    1.29x    |    1.25x    |
| inductor_no_cudagraphs |   1.24x    |    1.22x    |    1.23x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.16    |    2.43     |    1.91     |
|       aot_eager        |    5.77    |    7.84     |    7.05     |
|     aot_cudagraphs     |    8.60    |    16.10    |    13.16    |
|    nvprims_nvfuser     |   73.63    |   109.11    |   124.35    |
|        inductor        |   29.31    |    29.54    |    34.71    |
| inductor_no_cudagraphs |   28.61    |    25.45    |    33.28    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.85x    |    0.87x    |    0.84x    |
|        inductor        |   0.87x    |    0.72x    |    0.98x    |
| inductor_no_cudagraphs |   1.01x    |    0.96x    |    1.09x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7378 | 0.9441 | | torchbench | soft_actor_critic | 1.4286 | 0.9322 | | torchbench | nvidia_deeprecommender | 0.9036 | 0.9642 | | torchbench | dlrm | 0.0 | 1.0444 | | torchbench | hf_GPT2_large | 0.0 | 1.4742 | | torchbench | hf_T5 | 0.0 | 1.5685 | | torchbench | tacotron2 | 0.0 | 0.9028 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.7921 | 0.8299 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.5428 | +-------------+------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 371.9531 | 363.8208 | | torchbench | timm_efficientdet | 122.8743 | 119.0122 | +------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0018 | | torchbench | hf_Albert | 0.8836 | 1.2212 | | torchbench | mobilenet_v3_large | 0.8829 | 0.896 | | torchbench | hf_T5_large | 0.8737 | 0.922 | | torchbench | timm_vision_transformer_large | 0.8622 | 1.0312 | | torchbench | resnet50 | 0.8564 | 0.9343 | | torchbench | densenet121 | 0.8562 | 1.0006 | | torchbench | mnasnet1_0 | 0.8531 | 0.8659 | | torchbench | fastNLP_Bert | 0.8354 | 1.1229 | | torchbench | hf_Bart | 0.8318 | 1.1277 | | torchbench | resnext50_32x4d | 0.8302 | 0.8356 | | torchbench | BERT_pytorch | 0.826 | 1.0815 | | torchbench | hf_BigBird | 0.8211 | 1.0391 | | torchbench | dcgan | 0.767 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7609 | 0.9526 | | torchbench | timm_vision_transformer | 0.7517 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8335 | | torchbench | hf_Bert | 0.7062 | 1.0016 | | torchbench | resnet18 | 0.6902 | 0.7049 | | torchbench | LearningToPaint | 0.6889 | 0.916 | | torchbench | vgg16 | 0.6637 | 0.9553 | | torchbench | hf_DistilBert | 0.6595 | 0.9466 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4867 | 0.6781 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | functorch_dp_cifar10 | 0.4056 | 0.4214 | | torchbench | dlrm | nan | 0.7306 | | huggingface | AlbertForQuestionAnswering | 0.8646 | 1.4039 | | huggingface | T5Small | 0.8453 | 1.0606 | | huggingface | PegasusForConditionalGeneration | 0.8436 | 1.0204 | | huggingface | AlbertForMaskedLM | 0.842 | 1.3737 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.1049 | | huggingface | BigBird | 0.821 | 1.0085 | | huggingface | XGLMForCausalLM | 0.8157 | 0.9642 | | huggingface | M2M100ForConditionalGeneration | 0.8138 | 1.0093 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | ElectraForCausalLM | 0.7929 | 0.9036 | | huggingface | YituTechConvBert | 0.7888 | 0.8725 | | huggingface | PegasusForCausalLM | 0.7774 | 0.931 | | huggingface | BartForConditionalGeneration | 0.7734 | 0.9515 | | huggingface | GoogleFnet | 0.7698 | 0.9372 | | huggingface | MT5ForConditionalGeneration | 0.763 | 0.9406 | | huggingface | MegatronBertForQuestionAnswering | 0.7528 | 0.9646 | | huggingface | CamemBert | 0.7487 | 0.9186 | | huggingface | PLBartForCausalLM | 0.7381 | 0.9055 | | huggingface | PLBartForConditionalGeneration | 0.7238 | 0.9373 | | huggingface | MBartForConditionalGeneration | 0.7209 | 0.9059 | | huggingface | LayoutLMForSequenceClassification | 0.7189 | 1.0294 | | huggingface | MegatronBertForCausalLM | 0.7161 | 0.9247 | | huggingface | BartForCausalLM | 0.7149 | 0.9466 | | huggingface | BlenderbotSmallForCausalLM | 0.7147 | 0.8647 | | huggingface | ElectraForQuestionAnswering | 0.7054 | 1.0298 | | huggingface | DistilBertForQuestionAnswering | 0.6981 | 0.9303 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6977 | 0.946 | | huggingface | LayoutLMForMaskedLM | 0.695 | 0.9772 | | huggingface | MBartForCausalLM | 0.6836 | 0.8978 | | huggingface | TrOCRForCausalLM | 0.6827 | 0.8876 | | huggingface | Speech2Text2ForCausalLM | 0.6775 | 0.9179 | | huggingface | OPTForCausalLM | 0.6764 | 0.8848 | | huggingface | DistilBertForMaskedLM | 0.6531 | 0.9124 | | huggingface | BertForMaskedLM | 0.6385 | 0.8992 | | huggingface | RobertaForCausalLM | 0.6375 | 0.8974 | | huggingface | BertForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | RobertaForQuestionAnswering | 0.6329 | 0.8939 | | huggingface | MobileBertForMaskedLM | 0.5256 | 0.7111 | | huggingface | MobileBertForQuestionAnswering | 0.4536 | 0.5968 | | huggingface | DebertaForMaskedLM | 0.386 | 1.0347 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | timm_models | selecsls42b | 0.899 | 1.0046 | | timm_models | swsl_resnext101_32x16d | 0.8932 | 0.9946 | | timm_models | res2net50_14w_8s | 0.8821 | 1.0206 | | timm_models | regnety_002 | 0.8617 | 1.0396 | | timm_models | botnet26t_256 | 0.8605 | 0.9622 | | timm_models | pit_b_224 | 0.8563 | 1.0752 | | timm_models | sebotnet33ts_256 | 0.841 | 0.9709 | | timm_models | coat_lite_mini | 0.821 | 1.0246 | | timm_models | gernet_l | 0.7928 | 0.9926 | | timm_models | resmlp_12_224 | 0.7899 | 0.7979 | | timm_models | repvgg_a2 | 0.7684 | 0.9902 | | timm_models | convit_base | 0.7462 | 0.9008 | | timm_models | crossvit_9_240 | 0.6584 | 0.8853 | | timm_models | tnt_s_patch16_224 | nan | 0.8622 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-4/passrate_over_time.png : ![](https://i.imgur.com/NKSTwk0.png) ../test-dynamo-runner-logs-4/geomean_over_time.png : ![](https://i.imgur.com/McJ0X3H.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that run actually the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0008 | 1.0057 | 2.3434 | 0.0 | 5.2693 | 1.2666 | | timm_efficientdet | 1 | 0.9803 | 0.8926 | 1.8373 | 0.0 | 4.2948 | 1.5047 | | functorch_dp_cifar10 | 64 | 1.0098 | 1.0288 | 2.1432 | 0.0 | 3.7607 | 1.2459 | | timm_vision_transformer | 8 | 1.0061 | 0.9367 | 1.5235 | 0.6774 | 2.597 | 1.4078 | | drq | 1 | 1.0063 | 0.8655 | 1.66 | 0.701 | 2.4435 | 1.064 | | BERT_pytorch | 16 | 1.0128 | 0.888 | 1.11 | 0.9921 | 2.0945 | 2.1387 | | resnext50_32x4d | 8 | 1.0028 | 1.1006 | 1.2921 | 0.0 | 2.0234 | 1.192 | | mobilenet_v3_large | 32 | 1.0036 | 1.1076 | 1.0129 | 0.0 | 1.9873 | 1.3401 | | resnet18 | 16 | 1.0019 | 1.1088 | 1.148 | 0.0 | 1.8543 | 1.2494 | | pytorch_struct | 200 | 0.9969 | 0.7519 | 0.8876 | 0.8095 | 1.8197 | 1.1619 | | squeezenet1_1 | 32 | 0.9946 | 1.0094 | 1.0664 | 0.8555 | 1.7465 | 1.2652 | | lennard_jones | 1000 | 0.9615 | 0.8552 | 1.0328 | 0.6864 | 1.7378 | 0.9441 | | hf_T5_large | 2 | 1.0245 | 0.9081 | 0.0 | 0.9845 | 1.6753 | 1.9295 | | dcgan | 32 | 0.9805 | 1.0136 | 1.2702 | 0.7708 | 1.6664 | 1.0562 | | hf_Albert | 8 | 1.0012 | 0.9963 | 0.7507 | 1.4773 | 1.6427 | 1.6398 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9993 | 1.0074 | 1.3055 | 0.8421 | 1.6241 | 1.3441 | | speech_transformer | 32 | 1.0061 | 0.9316 | 1.5091 | 0.8117 | 1.5487 | 1.5451 | | shufflenet_v2_x1_0 | 128 | 1.0027 | 1.0438 | 0.8067 | 0.0 | 1.5411 | 1.3854 | | timm_resnest | 32 | 0.9992 | 1.0022 | 0.8044 | 0.0 | 1.5171 | 1.4537 | | hf_GPT2 | 4 | 1.0075 | 0.9813 | 0.7396 | 0.4168 | 1.4972 | 1.4989 | | timm_nfnet | 128 | 0.9995 | 1.0001 | 0.0 | 1.2476 | 1.4723 | 1.4237 | | mnasnet1_0 | 32 | 1.001 | 1.0946 | 0.8619 | 0.0 | 1.4645 | 1.2723 | | mobilenet_v2_quantized_qat | 96 | 1.0015 | 0.9797 | 0.0 | 0.0 | 1.4301 | 1.4311 | | mobilenet_v2 | 96 | 0.9996 | 0.9989 | 0.7294 | 0.0 | 1.4289 | 1.4017 | | soft_actor_critic | 256 | 0.9774 | 0.8054 | 1.0894 | 0.6863 | 1.4286 | 0.9322 | | fastNLP_Bert | 6 | 0.999 | 0.9764 | 0.7511 | 1.1759 | 1.4211 | 1.3917 | | resnet50_quantized_qat | 32 | 1.0004 | 0.973 | 0.0 | 0.0 | 1.3795 | 1.3803 | | timm_efficientnet | 32 | 0.9541 | 0.8118 | 0.6972 | 0.0 | 1.3538 | 1.195 | | LearningToPaint | 96 | 1.0012 | 1.049 | 0.8596 | 0.0 | 1.2663 | 1.1859 | | pytorch_stargan | 16 | 0.9991 | 1.0766 | 0.933 | 0.0 | 1.2614 | 1.2286 | | resnet50 | 32 | 0.999 | 0.9921 | 0.7608 | 0.0 | 1.2048 | 1.1686 | | hf_Bart | 4 | 1.0124 | 0.973 | 0.7858 | 0.7878 | 1.2029 | 1.1957 | | pytorch_unet | 1 | 0.9997 | 0.9975 | 0.8467 | 0.0 | 1.202 | 1.186 | | hf_Bert | 4 | 1.0216 | 0.9963 | 0.7315 | 0.9151 | 1.2011 | 1.1818 | | Super_SloMo | 6 | 0.9999 | 0.9982 | 0.8674 | 1.0023 | 1.1813 | 1.1645 | | hf_DistilBert | 8 | 1.0008 | 0.9567 | 0.6866 | 0.5228 | 1.1729 | 1.1789 | | vgg16 | 64 | 0.9998 | 0.999 | 0.8595 | 0.9977 | 1.1722 | 1.1668 | | alexnet | 128 | 0.999 | 0.9971 | 0.8025 | 1.0043 | 1.1602 | 1.1631 | | hf_Reformer | 4 | 0.9984 | 1.0012 | 0.9881 | 0.0 | 1.1311 | 1.14 | | timm_regnet | 32 | 0.9637 | 0.9603 | 0.7797 | 0.0 | 1.126 | 1.0908 | | Background_Matting | 4 | 1.0001 | 1.0212 | 0.8682 | 0.0 | 1.1155 | 1.1072 | | yolov3 | 16 | 1.0 | 0.9945 | 0.7916 | 1.2029 | 1.0913 | 1.0786 | | hf_BigBird | 2 | 0.9873 | 0.9345 | 0.9709 | 0.9006 | 1.0887 | 0.9962 | | attention_is_all_you_need_pytorch | 256 | 1.0003 | 0.968 | 0.756 | 0.9804 | 1.0642 | 1.0483 | | timm_vision_transformer_large | 8 | 0.9993 | 0.9953 | 0.0 | 0.976 | 1.0492 | 1.0361 | | timm_vovnet | 32 | 0.9089 | 0.9042 | 0.7153 | 0.0 | 1.007 | 1.0165 | | tts_angular | 64 | 0.9884 | 0.9598 | 0.9853 | 0.9695 | 1.0069 | 1.0177 | | demucs | 4 | 0.9995 | 0.9998 | 0.9996 | 1.0002 | 1.0002 | 1.0002 | | nvidia_deeprecommender | 256 | 0.9987 | 0.963 | 0.5847 | 0.976 | 0.9036 | 0.9642 | | dlrm | 2048 | 0.0 | 1.0515 | 0.0 | 0.9973 | 0.0 | 1.0444 | | hf_GPT2_large | 4 | 0.9991 | 0.9798 | 0.0 | 0.5989 | 0.0 | 1.4742 | | hf_T5 | 8 | 0.9993 | 0.953 | 0.0 | 1.247 | 0.0 | 1.5685 | | tacotron2 | 64 | 0.9754 | 0.8418 | 0.0 | 0.0 | 0.0 | 0.9028 | | hf_Longformer | 2 | 0.9473 | 0.8798 | 0.8034 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Reformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8614 | 7.0158 | 10.0377 | 109.6599 | 371.9531 | 363.8208 | | timm_efficientdet | 1 | 19.224 | 33.2178 | 66.224 | nan | 122.8743 | 119.0122 | | hf_T5_large | 2 | 13.8547 | 35.3758 | nan | 426.7214 | 102.4023 | 100.2926 | | timm_vision_transformer_large | 8 | 2.2387 | 11.1578 | nan | 253.043 | 50.511 | 49.0287 | | attention_is_all_you_need_pytorch | 256 | 1.1049 | 5.4952 | 8.92 | 108.9814 | 45.5129 | 44.4213 | | densenet121 | 4 | 2.0417 | 9.6272 | 15.6198 | nan | 41.6513 | 40.5756 | | timm_resnest | 32 | 0.5392 | 2.0095 | 3.0833 | nan | 39.8511 | 38.54 | | hf_BigBird | 2 | 7.4753 | 12.9008 | 25.7752 | 84.9528 | 37.7178 | 25.416 | | timm_vision_transformer | 8 | 0.7547 | 3.4535 | 4.9656 | 61.655 | 32.2756 | 29.7435 | | hf_Bart | 4 | 1.573 | 6.4352 | 10.845 | 118.5196 | 28.5612 | 27.4618 | | timm_nfnet | 128 | 1.914 | 6.2307 | nan | 131.6158 | 27.2858 | 27.0435 | | BERT_pytorch | 16 | 1.4301 | 5.9278 | 8.9954 | 83.4438 | 26.7428 | 26.3124 | | pytorch_stargan | 16 | 0.3876 | 1.7235 | 2.509 | nan | 26.573 | 26.3066 | | resnet50_quantized_qat | 32 | 1.1032 | 7.0465 | nan | nan | 26.3433 | 26.4722 | | mobilenet_v2_quantized_qat | 96 | 1.2571 | 7.2017 | nan | nan | 25.989 | 25.9592 | | fastNLP_Bert | 6 | 1.4423 | 5.23 | 9.1513 | 88.2481 | 25.6569 | 24.2144 | | speech_transformer | 32 | 1.607 | 6.8204 | 25.7941 | 117.8391 | 25.4129 | 25.0411 | | timm_regnet | 32 | 2.2012 | 6.5009 | 17.8336 | nan | 23.0356 | 22.88 | | mobilenet_v3_large | 32 | 0.8264 | 3.8889 | 5.7435 | nan | 22.7694 | 22.1405 | | timm_efficientnet | 32 | 1.6793 | 5.6688 | 13.8038 | nan | 22.1784 | 21.7219 | | pytorch_struct | 200 | 0.2413 | 0.6161 | 1.1654 | 4.0189 | 19.5008 | 18.2188 | | hf_Reformer | 4 | 1.6925 | 2.885 | 5.6044 | nan | 19.2174 | 15.9965 | | hf_Bert | 4 | 1.5142 | 5.2937 | 7.9301 | 89.0286 | 18.2225 | 17.5742 | | mnasnet1_0 | 32 | 0.763 | 3.4271 | 5.2587 | nan | 18.0671 | 17.6162 | | shufflenet_v2_x1_0 | 128 | 0.9168 | 4.0663 | 6.2239 | nan | 17.7175 | 16.8712 | | timm_vovnet | 32 | 1.4409 | 3.7788 | 8.8736 | nan | 17.5028 | 17.2754 | | resnet50 | 32 | 0.8201 | 3.7567 | 5.5967 | nan | 17.4673 | 16.9844 | | hf_Albert | 8 | 1.1841 | 4.5928 | 7.5068 | 103.8845 | 17.215 | 16.4293 | | resnext50_32x4d | 8 | 0.8406 | 3.7221 | 5.762 | nan | 16.9006 | 16.3333 | | hf_GPT2 | 4 | 1.4463 | 5.1416 | 7.63 | 69.0378 | 16.7157 | 16.1243 | | Super_SloMo | 6 | 0.9714 | 3.9762 | 5.5713 | 32.2723 | 16.4381 | 15.5588 | | Background_Matting | 4 | 0.6921 | 3.5676 | 5.501 | nan | 15.9924 | 15.0031 | | mobilenet_v2 | 96 | 0.7311 | 3.7079 | 5.8611 | nan | 15.8456 | 16.0991 | | functorch_dp_cifar10 | 64 | 0.3423 | 1.3407 | 2.0217 | nan | 12.2127 | 12.3331 | | hf_DistilBert | 8 | 0.6109 | 2.5533 | 4.5332 | 40.5139 | 11.6684 | 11.4337 | | resnet18 | 16 | 0.3851 | 1.4827 | 2.1284 | nan | 10.6175 | 10.2896 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3667 | 1.5555 | 2.2852 | 30.6837 | 7.8733 | 7.6621 | | pytorch_unet | 1 | 0.4249 | 1.6126 | 2.4816 | nan | 7.7689 | 7.4441 | | LearningToPaint | 96 | 0.4226 | 1.5308 | 2.3427 | nan | 6.8033 | 6.6982 | | squeezenet1_1 | 32 | 0.1909 | 0.659 | 1.0197 | 4.2894 | 3.9135 | 3.5092 | | drq | 1 | 0.2866 | 0.5031 | 0.8449 | 4.0736 | 3.653 | 3.3213 | | soft_actor_critic | 256 | 0.2006 | 0.2947 | 0.5216 | 1.515 | 3.364 | 2.8142 | | vgg16 | 64 | 0.186 | 0.4632 | 0.8377 | 2.7182 | 3.3332 | 3.2707 | | nvidia_deeprecommender | 256 | 0.1909 | 0.3714 | 0.6361 | 4.5277 | 3.213 | 2.9493 | | alexnet | 128 | 0.1474 | 0.3139 | 0.5577 | 2.9115 | 2.8864 | 2.6028 | | dcgan | 32 | 0.1651 | 0.3577 | 0.5697 | 4.2487 | 2.5997 | 2.3809 | | lennard_jones | 1000 | 0.1361 | 0.2436 | 0.3939 | 1.2155 | 2.0081 | 1.7488 | | tts_angular | 64 | 0.2053 | 0.2465 | 0.3741 | 1.0179 | 1.8876 | 1.7878 | | demucs | 4 | 0.2968 | 0.2938 | 0.3021 | 0.2903 | 0.204 | 0.2033 | | tacotron2 | 64 | 17.3452 | 29.1381 | nan | nan | nan | 63.1371 | | hf_GPT2_large | 4 | 5.1006 | 15.8775 | nan | 231.6096 | nan | 41.0449 | | hf_T5 | 8 | 2.4009 | 7.6274 | nan | 67.3711 | nan | 26.4199 | | dlrm | 2048 | nan | 0.7163 | nan | 2.7078 | nan | 2.9103 | | hf_Longformer | 2 | 6.1844 | 12.9431 | 57.4587 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | nan | 1.5819 | 1.5819 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | nan | 1.4874 | 1.4867 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | nan | 1.3107 | 1.3923 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.3631 | 0.9528 | 1.2027 | 1.4002 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | nan | 1.1743 | 1.2832 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.289 | nan | 1.1162 | 1.1442 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.8136 | 1.0823 | 1.1864 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2708 | 0.8942 | 1.0389 | 1.0454 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.8898 | 1.0219 | 1.0963 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | nan | 0.9832 | 1.0394 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | nan | 0.9814 | 1.0418 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 0.8845 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | nan | 0.9406 | 1.0831 | | yolov3 | 16 | 0.9957 | 0.844 | 0.3341 | 0.8182 | 0.9237 | 1.1052 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3915 | 0.8952 | 0.9169 | 0.9991 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | nan | 0.9118 | 1.105 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.8807 | 0.3236 | nan | 0.8982 | 1.0018 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 0.7425 | 0.8836 | 1.2212 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3278 | nan | 0.8829 | 0.896 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | 0.8425 | 0.8737 | 0.922 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | 0.8374 | 0.8622 | 1.0312 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | nan | 0.8564 | 0.9343 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | nan | 0.8562 | 1.0006 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | nan | 0.8531 | 0.8659 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 0.906 | 0.8354 | 1.1229 | | hf_Bart | 4 | 0.9617 | 0.8772 | 0.3385 | 0.8568 | 0.8318 | 1.1277 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | nan | 0.8302 | 0.8356 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3505 | 0.8837 | 0.826 | 1.0815 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4299 | 0.9608 | 0.8211 | 1.0391 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.767 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | nan | 0.7609 | 0.9526 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3313 | 0.8772 | 0.7517 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9555 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7449 | 0.743 | 0.8335 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 0.8929 | 0.7062 | 1.0016 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | nan | 0.6902 | 0.7049 | | LearningToPaint | 96 | 0.9471 | 0.7168 | 0.3387 | nan | 0.6889 | 0.916 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.6638 | 0.6637 | 0.9553 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3213 | 0.887 | 0.6595 | 0.9466 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | nan | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2963 | 0.9139 | 0.4867 | 0.6781 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5079 | 0.4222 | 0.4335 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | 0.4056 | 0.4214 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | nan | nan | 1.1623 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.8724 | nan | 1.1507 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | 0.876 | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | 0.7305 | nan | 0.7306 | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2944 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0344 | 0.8988 | 1.7609 | 0.7669 | 3.2462 | 1.4282 | | CamemBert | 1 | 1.0489 | 0.9111 | 1.3153 | 0.7487 | 2.3839 | 1.4892 | | MT5ForConditionalGeneration | 8 | 1.0249 | 0.9058 | 1.197 | 1.0478 | 2.2642 | 1.9968 | | DistillGPT2 | 1 | 1.0362 | 0.9281 | 1.0569 | 0.2843 | 2.1735 | 1.7704 | | MobileBertForMaskedLM | 32 | 1.0219 | 0.9277 | 1.1471 | 0.0 | 2.1432 | 1.5437 | | GoogleFnet | 1 | 0.9781 | 0.7916 | 0.9608 | 0.6787 | 1.8333 | 1.1417 | | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9777 | 0.0 | 0.7332 | 1.796 | 1.7868 | | M2M100ForConditionalGeneration | 8 | 1.1668 | 0.8916 | 0.8688 | 0.8792 | 1.4677 | 1.3152 | | T5ForConditionalGeneration | 4 | 1.0045 | 0.9328 | 0.7238 | 1.1659 | 1.4575 | 1.4377 | | ElectraForQuestionAnswering | 64 | 1.0001 | 0.984 | 0.0 | 1.2717 | 1.4259 | 1.4061 | | ElectraForCausalLM | 32 | 1.0002 | 0.9308 | 0.0 | 1.0449 | 1.4126 | 1.447 | | MobileBertForQuestionAnswering | 64 | 1.0269 | 0.899 | 0.8661 | 0.0 | 1.4009 | 1.3149 | | LayoutLMForSequenceClassification | 16 | 0.9999 | 0.9888 | 0.7371 | 1.1677 | 1.3004 | 1.2892 | | T5Small | 1 | 1.0283 | 0.898 | 1.0214 | 1.0075 | 1.2743 | 1.1416 | | AlbertForQuestionAnswering | 4 | 1.0013 | 1.0016 | 0.0 | 1.2136 | 1.2615 | 1.259 | | AlbertForMaskedLM | 4 | 1.0002 | 0.9995 | 0.0 | 1.2086 | 1.2555 | 1.2542 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9694 | 0.0 | 1.0981 | 1.2117 | 1.2128 | | PLBartForConditionalGeneration | 16 | 1.0171 | 0.9677 | 0.82 | 0.8295 | 1.2074 | 1.2039 | | OPTForCausalLM | 32 | 1.001 | 0.9321 | 0.7133 | 0.4583 | 1.1814 | 1.2322 | | XGLMForCausalLM | 8 | 1.0134 | 0.8793 | 0.7416 | 0.3262 | 1.1703 | 1.183 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.985 | 0.713 | 0.5283 | 1.1701 | 1.151 | | RobertaForCausalLM | 64 | 1.0005 | 0.9613 | 0.7458 | 0.9897 | 1.1479 | 1.1508 | | MegatronBertForQuestionAnswering | 16 | 1.0391 | 1.0134 | 0.7678 | 0.904 | 1.1423 | 1.1242 | | Speech2Text2ForCausalLM | 128 | 0.9987 | 0.9247 | 0.6616 | 0.9473 | 1.1342 | 1.152 | | MegatronBertForCausalLM | 16 | 1.0352 | 1.0109 | 0.7389 | 0.9715 | 1.1289 | 1.1169 | | BertForQuestionAnswering | 128 | 1.0003 | 0.9934 | 0.0 | 1.0534 | 1.1144 | 1.1076 | | RobertaForQuestionAnswering | 128 | 1.0002 | 0.9929 | 0.0 | 1.0538 | 1.1124 | 1.1142 | | BartForConditionalGeneration | 2 | 1.0002 | 0.9869 | 0.0 | 0.4455 | 1.1005 | 1.0887 | | BartForCausalLM | 4 | 1.0008 | 0.9659 | 0.7558 | 1.0034 | 1.0903 | 1.1102 | | BigBird | 1 | 0.9842 | 0.9253 | 0.9888 | 0.8937 | 1.0902 | 0.9951 | | PegasusForConditionalGeneration | 16 | 1.01 | 0.9642 | 0.7552 | 0.9091 | 1.0885 | 1.0682 | | MBartForConditionalGeneration | 16 | 1.0101 | 0.9844 | 0.7644 | 0.9354 | 1.0882 | 1.1586 | | DebertaForMaskedLM | 4 | 0.9045 | 0.7846 | 0.723 | 0.6431 | 1.0785 | 1.0406 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0007 | 0.9255 | 0.0 | 0.9561 | 1.0642 | 1.0726 | | BertForMaskedLM | 64 | 1.0001 | 0.9609 | 0.7301 | 0.9877 | 1.0587 | 1.0605 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.9507 | 0.7124 | 0.618 | 1.0496 | 1.0677 | | DebertaForQuestionAnswering | 8 | 0.996 | 0.966 | 0.6825 | 0.8678 | 1.0489 | 1.2207 | | PLBartForCausalLM | 32 | 1.0063 | 0.9333 | 0.718 | 0.9233 | 1.0279 | 1.0546 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9104 | 0.6832 | 0.9228 | 1.0063 | 1.043 | | TrOCRForCausalLM | 32 | 1.0008 | 0.9558 | 0.7333 | 0.9509 | 1.0037 | 1.014 | | MBartForCausalLM | 32 | 1.0004 | 0.9539 | 0.7319 | 0.956 | 0.9984 | 1.0098 | | PegasusForCausalLM | 32 | 0.9994 | 0.9522 | 0.7318 | 0.9518 | 0.991 | 1.0027 | | AllenaiLongformerBase | 1 | 0.9248 | 0.8421 | 0.7665 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BigBird | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | GoogleFnet | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8696 | 10.4793 | 34.4488 | 80.102 | 95.1799 | 33.9856 | | DebertaForMaskedLM | 4 | 4.8903 | 10.1684 | 39.0136 | 82.2904 | 89.5165 | 32.8486 | | XGLMForCausalLM | 8 | 2.4373 | 10.1151 | 22.2057 | 184.0347 | 67.472 | 64.5787 | | M2M100ForConditionalGeneration | 8 | 2.6255 | 12.8551 | 20.3877 | 240.1962 | 50.6846 | 53.7745 | | MobileBertForMaskedLM | 32 | 8.2909 | 24.9714 | 41.3933 | nan | 48.7079 | 47.9116 | | MobileBertForQuestionAnswering | 64 | 8.3929 | 23.8292 | 41.2316 | nan | 48.0873 | 48.1985 | | BartForConditionalGeneration | 2 | 3.0164 | 12.3913 | nan | 261.2649 | 43.1544 | 40.7754 | | PegasusForConditionalGeneration | 16 | 2.8098 | 12.217 | 20.3838 | 266.7147 | 42.5357 | 39.3584 | | MBartForConditionalGeneration | 16 | 3.0064 | 12.8568 | 22.1754 | 271.3029 | 41.265 | 39.9633 | | YituTechConvBert | 1 | 2.2847 | 8.4615 | 12.8935 | 128.5077 | 39.1252 | 36.8927 | | BigBird | 1 | 7.4673 | 13.2271 | 25.8571 | 97.2564 | 37.3978 | 24.4872 | | MegatronBertForCausalLM | 16 | 3.25 | 10.8935 | 16.6483 | 190.2921 | 32.5107 | 31.4699 | | MegatronBertForQuestionAnswering | 16 | 3.2629 | 10.8829 | 17.1363 | 188.8132 | 32.2158 | 30.6754 | | MT5ForConditionalGeneration | 8 | 3.7736 | 11.2854 | 17.9664 | 104.4138 | 31.3498 | 30.5518 | | T5ForConditionalGeneration | 4 | 2.4031 | 8.0927 | 12.7737 | 67.9725 | 29.6106 | 28.1192 | | BlenderbotSmallForConditionalGeneration | 64 | 1.9057 | 8.3398 | nan | 164.3311 | 28.9149 | 27.9222 | | T5Small | 1 | 2.4009 | 7.7054 | 11.553 | 70.5699 | 28.2884 | 27.324 | | LayoutLMForSequenceClassification | 16 | 1.8371 | 5.7627 | 9.2001 | 90.5694 | 27.2105 | 25.9046 | | PLBartForConditionalGeneration | 16 | 1.6054 | 6.6586 | 10.115 | 117.193 | 25.7334 | 25.1247 | | ElectraForCausalLM | 32 | 1.5128 | 5.4868 | nan | 88.7785 | 25.6426 | 23.597 | | PegasusForCausalLM | 32 | 1.1507 | 4.9241 | 7.9631 | 86.0692 | 21.1082 | 19.9817 | | MBartForCausalLM | 32 | 1.1314 | 4.719 | 7.6295 | 89.0267 | 20.6058 | 20.1791 | | GoogleFnet | 1 | 0.9536 | 2.926 | 9.0179 | 70.125 | 20.3296 | 13.4172 | | LayoutLMForMaskedLM | 16 | 1.9758 | 5.8564 | nan | 87.4187 | 20.3206 | 19.4557 | | BertForMaskedLM | 64 | 1.5049 | 5.2893 | 7.9608 | 90.3134 | 19.7607 | 19.0687 | | TrOCRForCausalLM | 32 | 1.1652 | 4.9065 | 7.5491 | 89.377 | 19.5229 | 18.252 | | ElectraForQuestionAnswering | 64 | 1.495 | 5.3576 | nan | 87.376 | 19.2805 | 18.7191 | | RobertaForCausalLM | 64 | 1.4981 | 5.9284 | 8.172 | 90.7714 | 19.2058 | 18.4299 | | BertForQuestionAnswering | 128 | 1.4996 | 5.37 | nan | 86.7613 | 19.0338 | 18.2877 | | BartForCausalLM | 4 | 1.2393 | 4.7412 | 7.3513 | 89.429 | 18.9341 | 18.4051 | | RobertaForQuestionAnswering | 128 | 1.5276 | 5.5296 | nan | 89.6935 | 18.2219 | 17.4613 | | CamemBert | 1 | 1.5741 | 5.4813 | 7.5863 | 97.4246 | 17.7886 | 18.1791 | | OPTForCausalLM | 32 | 1.2069 | 4.8382 | 9.4313 | 85.7846 | 17.089 | 16.6391 | | GPT2ForSequenceClassification | 4 | 1.4922 | 5.3664 | nan | 70.7582 | 16.288 | 15.8247 | | AlbertForMaskedLM | 4 | 1.2941 | 4.7028 | nan | 103.089 | 16.2048 | 15.0298 | | AlbertForQuestionAnswering | 4 | 1.2907 | 4.7446 | nan | 100.8213 | 15.793 | 14.9597 | | Speech2Text2ForCausalLM | 128 | 0.7228 | 2.6601 | 4.147 | 36.6383 | 14.6927 | 13.351 | | BlenderbotSmallForCausalLM | 64 | 0.7996 | 3.2077 | 4.9571 | 54.4288 | 14.2352 | 13.6993 | | PLBartForCausalLM | 32 | 0.6579 | 2.7688 | 3.8771 | 42.7143 | 13.2429 | 13.0004 | | DistillGPT2 | 1 | 0.8116 | 2.6374 | 3.9301 | 39.9989 | 12.4299 | 12.0662 | | DistilBertForMaskedLM | 64 | 0.6267 | 2.6339 | 4.5363 | 42.5658 | 11.315 | 10.7331 | | DistilBertForQuestionAnswering | 64 | 0.6283 | 2.6887 | 4.4822 | 39.0327 | 10.7323 | 10.169 | | AllenaiLongformerBase | 1 | 6.2745 | 13.2036 | 57.3764 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 0.8955 | 1.0595 | 1.1224 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.5681 | 0.8646 | 1.4039 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.8577 | 0.8453 | 1.0606 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 0.9642 | 0.8436 | 1.0204 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.5667 | 0.842 | 1.3737 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.9093 | 0.8215 | 1.1049 | | BigBird | 1 | 0.9979 | 0.9536 | 0.4208 | 0.9117 | 0.821 | 1.0085 | | XGLMForCausalLM | 8 | 0.9848 | 0.9137 | 0.3971 | 0.9267 | 0.8157 | 0.9642 | | M2M100ForConditionalGeneration | 8 | 1.0217 | 0.9507 | 0.3799 | 0.9742 | 0.8138 | 1.0093 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3769 | 0.76 | 0.8057 | 0.9257 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.7909 | 0.7929 | 0.9036 | | YituTechConvBert | 1 | 0.9863 | 0.8573 | 0.3681 | 0.8286 | 0.7888 | 0.8725 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9232 | 0.7774 | 0.931 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.8866 | 0.7734 | 0.9515 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3715 | 0.9293 | 0.7698 | 0.9372 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4151 | 0.8853 | 0.763 | 0.9406 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.8549 | 0.7528 | 0.9646 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | 0.7949 | 0.7487 | 0.9186 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.861 | 0.7381 | 0.9055 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.3581 | 0.872 | 0.7238 | 0.9373 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | 0.8566 | 0.7209 | 0.9059 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 0.9204 | 0.7189 | 1.0294 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.8713 | 0.7161 | 0.9247 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.8956 | 0.7149 | 0.9466 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.8401 | 0.7147 | 0.8647 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 0.9357 | 0.7054 | 1.0298 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3178 | 0.8865 | 0.6981 | 0.9303 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 0.8975 | 0.6977 | 0.946 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.8883 | 0.695 | 0.9772 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | 0.89 | 0.6836 | 0.8978 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.8898 | 0.6827 | 0.8876 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.8765 | 0.6775 | 0.9179 | | OPTForCausalLM | 32 | 0.9982 | 0.8657 | 0.3606 | 0.7895 | 0.6764 | 0.8848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.8016 | 0.6531 | 0.9124 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.855 | 0.6385 | 0.8992 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.8538 | 0.6375 | 0.8974 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 0.9303 | 0.6329 | 0.8939 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.5256 | 0.7111 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.4536 | 0.5968 | | DebertaForMaskedLM | 4 | 1.0 | 0.9843 | 0.3552 | 0.9262 | 0.386 | 1.0347 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.063 | 0.2902 | 1.1588 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3208 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9994 | 0.9731 | 0.8183 | 0.0 | 1.8718 | 1.8284 | | lcnet_050 | 128 | 0.9558 | 0.9489 | 0.7699 | 1.3477 | 1.6601 | 1.6228 | | regnety_002 | 128 | 0.9757 | 1.0017 | 0.8619 | 0.0 | 1.4928 | 1.3259 | | dm_nfnet_f0 | 128 | 0.9999 | 0.9997 | 0.0 | 1.2524 | 1.4716 | 1.4239 | | xcit_large_24_p8_224 | 5 | 1.0025 | 0.9839 | 0.7787 | 0.0 | 1.4359 | 1.3257 | | hrnet_w18 | 128 | 0.9999 | 0.9983 | 0.0 | 0.0 | 1.4165 | 1.3777 | | dla102 | 128 | 0.9999 | 1.0006 | 0.0 | 0.0 | 1.3836 | 1.3692 | | volo_d1_224 | 64 | 1.0 | 0.9945 | 0.802 | 0.0 | 1.3817 | 1.36 | | nfnet_l0 | 128 | 0.9996 | 0.789 | 0.0 | 1.2306 | 1.3724 | 1.3282 | | res2net50_14w_8s | 128 | 0.9998 | 0.9992 | 0.0 | 0.0 | 1.3566 | 1.3244 | | mobilenetv3_large_100 | 128 | 0.9658 | 0.9618 | 0.7658 | 0.0 | 1.3373 | 1.3431 | | mobilenetv2_100 | 128 | 0.9647 | 0.9637 | 0.7075 | 0.0 | 1.3369 | 1.354 | | coat_lite_mini | 128 | 0.9999 | 0.9834 | 0.8344 | 1.1056 | 1.333 | 1.3212 | | inception_v3 | 128 | 0.9999 | 0.996 | 0.0 | 0.0 | 1.3299 | 1.3084 | | gluon_inception_v3 | 128 | 0.9999 | 0.9984 | 0.0 | 0.0 | 1.3281 | 1.3084 | | adv_inception_v3 | 128 | 1.0 | 0.9989 | 0.0 | 0.0 | 1.3237 | 1.3076 | | crossvit_9_240 | 128 | 0.9997 | 0.9982 | 0.7599 | 1.0529 | 1.3213 | 1.3008 | | resnest101e | 64 | 0.9996 | 1.003 | 0.0 | 0.0 | 1.3157 | 1.2707 | | res2next50 | 128 | 0.9999 | 1.0007 | 0.0 | 0.0 | 1.3098 | 1.2736 | | jx_nest_base | 32 | 1.0003 | 0.9955 | 0.7311 | 0.0 | 1.2777 | 1.2486 | | fbnetv3_b | 128 | 0.9642 | 0.9607 | 0.7578 | 0.0 | 1.2759 | 1.2981 | | sebotnet33ts_256 | 64 | 0.9758 | 0.803 | 0.0 | 0.0 | 1.2673 | 1.2692 | | selecsls42b | 128 | 0.9999 | 0.9988 | 0.8164 | 0.0 | 1.2673 | 1.2531 | | eca_botnext26ts_256 | 128 | 0.9867 | 0.7712 | 0.0 | 0.0 | 1.2659 | 1.2526 | | gmixer_24_224 | 128 | 0.9999 | 0.8097 | 0.0 | 1.0484 | 1.2617 | 1.2341 | | eca_halonext26ts | 128 | 0.9871 | 0.7786 | 0.0 | 0.0 | 1.2592 | 1.244 | | botnet26t_256 | 128 | 0.9856 | 0.9814 | 0.7881 | 0.0 | 1.2575 | 1.2606 | | mnasnet_100 | 128 | 0.966 | 0.9637 | 0.7877 | 0.0 | 1.2555 | 1.2822 | | tf_efficientnet_b0 | 128 | 0.9767 | 0.7831 | 0.0 | 0.0 | 1.2551 | 1.2683 | | fbnetc_100 | 128 | 0.9669 | 0.9628 | 0.7918 | 0.0 | 1.2497 | 1.2646 | | ese_vovnet19b_dw | 128 | 0.9791 | 0.9776 | 0.7447 | 0.0 | 1.2409 | 1.2475 | | spnasnet_100 | 128 | 0.961 | 0.9576 | 0.775 | 0.0 | 1.2373 | 1.253 | | res2net101_26w_4s | 64 | 0.9999 | 0.9971 | 0.7756 | 0.0 | 1.2236 | 1.1884 | | convit_base | 64 | 0.9997 | 0.9981 | 0.0 | 1.3105 | 1.2196 | 1.2094 | | rexnet_100 | 128 | 0.9732 | 0.8157 | 0.0 | 0.0 | 1.212 | 1.2191 | | cspdarknet53 | 64 | 0.9582 | 0.9523 | 0.737 | 1.2258 | 1.2104 | 1.2375 | | pnasnet5large | 16 | 0.9996 | 0.9982 | 0.0 | 0.0 | 1.2101 | 1.1942 | | twins_pcpvt_base | 64 | 1.0 | 0.9981 | 0.7489 | 1.0218 | 1.2084 | 1.1684 | | gmlp_s16_224 | 128 | 1.0 | 0.9493 | 0.0 | 1.0772 | 1.2002 | 1.1894 | | tinynet_a | 128 | 0.966 | 0.7753 | 0.6219 | 0.0 | 1.1899 | 1.194 | | dpn107 | 32 | 0.9577 | 0.9506 | 0.7805 | 0.0 | 1.1877 | 1.1992 | | pit_b_224 | 64 | 1.0003 | 0.9992 | 0.0 | 1.0508 | 1.1876 | 1.1775 | | cait_m36_384 | 4 | 1.0001 | 1.0266 | 0.0 | 1.0929 | 1.1807 | 1.157 | | repvgg_a2 | 128 | 0.964 | 0.9623 | 0.8285 | 1.1371 | 1.1713 | 1.1687 | | tf_mixnet_l | 128 | 0.9856 | 0.8896 | 0.0 | 0.0 | 1.1693 | 1.167 | | mobilevit_s | 64 | 0.9791 | 0.7621 | 0.0 | 0.0 | 1.1676 | 1.1689 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.0 | 0.0 | 1.1668 | 1.1468 | | mixnet_l | 128 | 0.9848 | 0.8855 | 0.0 | 0.0 | 1.1503 | 1.1485 | | swin_base_patch4_window7_224 | 64 | 1.0002 | 0.9779 | 0.0 | 0.0 | 1.1363 | 1.1333 | | beit_base_patch16_224 | 64 | 0.9997 | 0.9823 | 0.0 | 0.9404 | 1.1137 | 1.1025 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.9995 | 0.0 | 0.0 | 1.1075 | 1.0713 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9984 | 0.7679 | 1.0025 | 1.0947 | 1.0821 | | gluon_xception65 | 32 | 0.9999 | 0.997 | 0.0 | 0.0 | 1.0869 | 1.0755 | | vit_base_patch16_224 | 64 | 1.0002 | 0.9981 | 0.7651 | 0.9715 | 1.0864 | 1.0709 | | convmixer_768_32 | 32 | 0.9998 | 0.9998 | 0.0 | 0.0 | 1.0776 | 1.0742 | | gernet_l | 128 | 0.9739 | 0.9725 | 0.8228 | 0.0 | 1.076 | 1.0708 | | convnext_base | 64 | 0.9999 | 0.9984 | 0.0 | 1.2056 | 1.074 | 1.0694 | | mixer_b16_224 | 128 | 1.0 | 0.9778 | 0.0 | 0.9032 | 1.0662 | 1.0611 | | visformer_small | 128 | 0.9996 | 1.0017 | 0.798 | 0.0 | 1.0471 | 1.0124 | | resmlp_12_224 | 128 | 0.9998 | 0.8547 | 0.612 | 1.0527 | 0.7921 | 0.8299 | | tnt_s_patch16_224 | 128 | 1.0001 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.5428 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | pass | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.6758 | 24.2259 | nan | nan | 97.9129 | 94.4011 | | swin_base_patch4_window7_224 | 64 | 2.5127 | 11.1487 | nan | nan | 74.4331 | 73.0858 | | mobilevit_s | 64 | 1.6771 | 5.9554 | nan | nan | 72.5534 | 70.5904 | | xcit_large_24_p8_224 | 5 | 2.5972 | 13.7943 | 26.3818 | nan | 72.1378 | 68.2345 | | pnasnet5large | 16 | 4.4234 | 18.195 | nan | nan | 70.2334 | 66.2853 | | twins_pcpvt_base | 64 | 2.2111 | 10.3305 | 18.7337 | 305.7279 | 61.7391 | 61.8269 | | cait_m36_384 | 4 | 2.6499 | 14.2511 | nan | 341.4789 | 60.2508 | 58.4439 | | convnext_base | 64 | 1.1844 | 5.1544 | nan | 114.4446 | 59.2765 | 58.0018 | | resnest101e | 64 | 3.1624 | 12.8703 | nan | nan | 55.191 | 53.935 | | jx_nest_base | 32 | 1.7818 | 7.4274 | 13.427 | nan | 53.2076 | 50.6647 | | res2net101_26w_4s | 64 | 2.981 | 13.2332 | 22.8468 | nan | 52.9602 | 48.8257 | | res2net50_14w_8s | 128 | 2.5697 | 12.0241 | nan | nan | 47.7386 | 44.5669 | | coat_lite_mini | 128 | 1.1213 | 4.1486 | 6.5904 | 85.8414 | 47.2758 | 47.0832 | | sebotnet33ts_256 | 64 | 1.5707 | 5.4185 | nan | nan | 46.5574 | 45.5205 | | eca_halonext26ts | 128 | 1.4776 | 4.5311 | nan | nan | 46.4726 | 45.7912 | | poolformer_m36 | 64 | 1.8539 | 7.0827 | nan | nan | 43.8955 | 43.9633 | | gmlp_s16_224 | 128 | 0.9854 | 5.1687 | nan | 119.5521 | 39.3374 | 37.3765 | | eca_botnext26ts_256 | 128 | 1.3412 | 4.4601 | nan | nan | 38.723 | 37.7321 | | dpn107 | 32 | 3.7593 | 11.499 | 35.9943 | nan | 37.6314 | 35.6378 | | fbnetv3_b | 128 | 2.9909 | 9.2712 | 25.5834 | nan | 37.0176 | 32.9729 | | crossvit_9_240 | 128 | 1.3783 | 6.3515 | 10.4875 | 151.9715 | 36.5609 | 34.4028 | | botnet26t_256 | 128 | 1.3216 | 3.6816 | 8.3269 | nan | 35.2519 | 35.0131 | | volo_d1_224 | 64 | 1.3955 | 6.0708 | 9.9995 | nan | 35.1971 | 32.658 | | gluon_xception65 | 32 | 1.7601 | 8.7117 | nan | nan | 34.1982 | 32.1738 | | adv_inception_v3 | 128 | 1.5995 | 6.837 | nan | nan | 32.8611 | 30.2135 | | inception_v3 | 128 | 1.5111 | 7.0072 | nan | nan | 31.8237 | 30.8443 | | gluon_inception_v3 | 128 | 1.502 | 6.873 | nan | nan | 31.6109 | 30.9949 | | ghostnet_100 | 128 | 2.6525 | 7.9379 | 12.7268 | nan | 31.1712 | 30.1409 | | tf_mixnet_l | 128 | 5.5719 | 11.2945 | nan | nan | 30.7085 | 29.4208 | | dla102 | 128 | 1.7017 | 7.6465 | nan | nan | 29.7943 | 28.6222 | | mixnet_l | 128 | 5.2959 | 10.8869 | nan | nan | 29.5401 | 28.9401 | | gmixer_24_224 | 128 | 1.0432 | 5.7894 | nan | 119.8018 | 29.2157 | 28.4857 | | swsl_resnext101_32x16d | 32 | 1.6284 | 7.482 | nan | nan | 28.6665 | 27.2489 | | dm_nfnet_f0 | 128 | 2.0469 | 6.6058 | nan | 131.7866 | 28.4855 | 27.7133 | | convit_base | 64 | 1.0715 | 4.7688 | nan | 99.9877 | 27.4687 | 26.505 | | res2next50 | 128 | 1.5744 | 6.7033 | nan | nan | 27.3919 | 25.8268 | | tinynet_a | 128 | 1.9908 | 6.5318 | 17.5592 | nan | 25.4807 | 24.1305 | | rexnet_100 | 128 | 1.8109 | 6.1602 | nan | nan | 25.4385 | 24.8808 | | tf_efficientnet_b0 | 128 | 1.7427 | 5.6416 | nan | nan | 22.605 | 21.0662 | | cspdarknet53 | 64 | 2.1923 | 6.3219 | 16.6104 | 111.791 | 22.3753 | 21.0157 | | resmlp_12_224 | 128 | 0.6079 | 2.4407 | 3.9406 | 29.6501 | 22.1366 | 20.9371 | | mixer_b16_224 | 128 | 0.6668 | 2.6879 | nan | 60.4097 | 22.033 | 20.1957 | | visformer_small | 128 | 0.927 | 3.4075 | 5.4637 | nan | 21.353 | 20.4765 | | nfnet_l0 | 128 | 1.7629 | 6.1974 | nan | 119.5436 | 21.0405 | 19.9757 | | convmixer_768_32 | 32 | 1.0919 | 4.9011 | nan | nan | 21.0102 | 19.7444 | | spnasnet_100 | 128 | 1.8763 | 5.353 | 14.956 | nan | 20.6172 | 19.5156 | | fbnetc_100 | 128 | 1.9551 | 5.5139 | 15.2878 | nan | 20.5856 | 19.8788 | | mobilenetv3_large_100 | 128 | 1.4853 | 4.5798 | 11.7251 | nan | 19.5646 | 18.995 | | beit_base_patch16_224 | 64 | 1.0998 | 4.2197 | nan | 76.8368 | 19.3602 | 18.6469 | | deit_base_distilled_patch16_224 | 64 | 0.8309 | 3.4963 | 5.8135 | 64.2137 | 19.3532 | 18.1886 | | mnasnet_100 | 128 | 1.5433 | 4.4024 | 11.6242 | nan | 18.6775 | 16.7912 | | vit_base_patch16_224 | 64 | 0.8307 | 3.4866 | 6.1571 | 62.9057 | 18.5383 | 17.9197 | | mobilenetv2_100 | 128 | 1.6797 | 4.5132 | 11.6029 | nan | 18.3415 | 17.3047 | | repvgg_a2 | 128 | 1.8844 | 5.2869 | 13.9984 | 216.8296 | 17.6041 | 16.9046 | | pit_b_224 | 64 | 0.9745 | 3.9159 | nan | 82.2597 | 17.3771 | 16.6796 | | gernet_l | 128 | 1.8878 | 5.0447 | 13.713 | nan | 17.0365 | 16.3314 | | regnety_002 | 128 | 1.5071 | 4.5384 | 11.3701 | nan | 16.9243 | 16.4111 | | selecsls42b | 128 | 0.8012 | 2.9711 | 4.8528 | nan | 15.2159 | 14.5701 | | lcnet_050 | 128 | 0.9774 | 2.8222 | 6.6583 | 67.5505 | 13.0877 | 12.0854 | | ese_vovnet19b_dw | 128 | 0.9845 | 2.526 | 5.9736 | nan | 12.4852 | 11.6486 | | tnt_s_patch16_224 | 128 | 1.546 | 8.1097 | nan | nan | nan | 31.7087 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 0.9166 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2616 | nan | 1.351 | 1.5843 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.8322 | 1.2911 | 1.4945 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | nan | 1.2619 | 1.4738 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | nan | 1.2059 | 1.3819 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1792 | 1.3591 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | nan | 1.1771 | 1.3424 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | nan | 1.1752 | 1.2828 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7674 | nan | nan | 1.1378 | 1.3608 | | eca_halonext26ts | 128 | 0.9938 | 0.7687 | nan | nan | 1.1376 | 1.3403 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | 0.933 | 1.1184 | 1.1751 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0526 | 1.0689 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.8897 | 1.0218 | 1.0961 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | 0.9286 | 1.0038 | 1.0607 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | nan | 1.0033 | 1.1036 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 0.9163 | 0.997 | 1.0835 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | nan | 0.9926 | 1.051 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 0.9167 | 0.9926 | 1.0799 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3131 | 0.8423 | 0.9924 | 1.0856 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | nan | 0.9853 | 1.1265 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9837 | 1.0658 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 0.8965 | 0.9827 | 1.0538 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | nan | 0.9769 | 1.1451 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9209 | 0.9766 | 0.9827 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9633 | 1.0572 | | dla102 | 128 | 0.9831 | 0.9169 | nan | nan | 0.9632 | 1.0419 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | nan | 0.952 | 1.0925 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | nan | 0.942 | 0.9938 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | nan | 0.9408 | 1.0412 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | nan | 0.9382 | 0.993 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | nan | 0.9379 | 1.0122 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2863 | nan | 0.9348 | 1.0604 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | nan | 0.9325 | 0.9919 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | nan | 0.9285 | 1.015 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.7725 | 0.9152 | 0.9655 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0634 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9138 | 1.0635 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | nan | 0.9137 | 1.0634 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | 0.8692 | 0.9127 | 0.9981 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | nan | 0.9078 | 1.0156 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.9069 | 1.0515 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | nan | 0.9069 | 1.0618 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | nan | 0.9059 | 0.9905 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8297 | 0.9052 | 1.0666 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | nan | 0.9049 | 0.9968 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.994 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | nan | 0.899 | 1.0046 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | nan | 0.8932 | 0.9946 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | nan | 0.8821 | 1.0206 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | nan | 0.8617 | 1.0396 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9622 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 0.7501 | 0.8563 | 1.0752 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.841 | 0.9709 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.7284 | 0.821 | 1.0246 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | nan | 0.7928 | 0.9926 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.6275 | 0.7899 | 0.7979 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.7257 | 0.7684 | 0.9902 | | convit_base | 64 | 0.9977 | 0.8838 | nan | 0.8762 | 0.7462 | 0.9008 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 0.8418 | 0.6584 | 0.8853 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.8622 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-4/huggingface_float32.png : ![](https://i.imgur.com/DlR5WRX.png) ../test-dynamo-runner-logs-4/timm_models_float32.png : ![](https://i.imgur.com/MZIrTis.png) ../test-dynamo-runner-logs-4/torchbench_float32.png : ![](https://i.imgur.com/ApYpemd.png)

williamwen42 commented 2 years ago

Performance Dashboard for float32 precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 54/56 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 96%, 54/56 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 82%, 46/56 | 77%, 33/43  | 44%, 27/61  |
|    nvprims_nvfuser     | 80%, 45/56 | 60%, 26/43  | 67%, 41/61  |
|        inductor        | 84%, 47/56 | 79%, 34/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 91%, 51/56 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.02x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.11x    |    1.04x    |    1.00x    |
|    nvprims_nvfuser     |   1.04x    |    1.03x    |    1.13x    |
|        inductor        |   1.45x    |    1.29x    |    1.21x    |
| inductor_no_cudagraphs |   1.21x    |    1.18x    |    1.20x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.18    |    2.43     |    1.90     |
|       aot_eager        |    5.79    |    7.70     |    7.05     |
|     aot_cudagraphs     |    8.75    |    15.76    |    13.49    |
|    nvprims_nvfuser     |   68.19    |   105.96    |   149.34    |
|        inductor        |   42.33    |    33.17    |    46.14    |
| inductor_no_cudagraphs |   40.81    |    26.47    |    44.67    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    1.00x    |    0.99x    |
|       aot_eager        |   0.87x    |    0.91x    |    0.87x    |
|     aot_cudagraphs     |   0.39x    |    0.36x    |    0.31x    |
|    nvprims_nvfuser     |   0.91x    |    1.00x    |    0.94x    |
|        inductor        |   0.83x    |    0.66x    |    0.97x    |
| inductor_no_cudagraphs |   0.96x    |    0.88x    |    1.08x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------+----------+------------------------+ | torchbench | lennard_jones | 1.7979 | 0.938 | | torchbench | soft_actor_critic | 1.433 | 0.9407 | | torchbench | nvidia_deeprecommender | 0.9039 | 0.9639 | | torchbench | dlrm | 0.0 | 1.0262 | | torchbench | hf_GPT2_large | 0.0 | 1.4738 | | torchbench | hf_T5 | 0.0 | 1.5418 | | torchbench | tacotron2 | 0.0 | 0.9121 | | torchbench | functorch_dp_cifar10 | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | PegasusForCausalLM | 0.947 | 0.9559 | | huggingface | BlenderbotSmallForCausalLM | 0.9317 | 0.9605 | | huggingface | LayoutLMForMaskedLM | 0.0 | 1.1632 | | huggingface | RobertaForQuestionAnswering | 0.0 | 1.0593 | | huggingface | BertForQuestionAnswering | 0.0 | 1.0604 | | huggingface | ElectraForCausalLM | 0.0 | 1.382 | | huggingface | AlbertForQuestionAnswering | 0.0 | 1.243 | | huggingface | AlbertForMaskedLM | 0.0 | 1.2409 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | resmlp_12_224 | 0.8056 | 0.8487 | | timm_models | tnt_s_patch16_224 | 0.0 | 1.4909 | +-------------+-----------------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 381.5792 | 380.389 | | torchbench | densenet121 | 220.1328 | 225.51 | | torchbench | timm_efficientdet | 174.1247 | 172.4918 | | timm_models | hrnet_w18 | 122.4221 | 116.8456 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+-----------------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------------------------+----------+------------------------+ | torchbench | timm_resnest | 0.8982 | 1.0013 | | torchbench | squeezenet1_1 | 0.8931 | 0.9113 | | torchbench | mobilenet_v3_large | 0.8834 | 0.902 | | torchbench | hf_T5_large | 0.8736 | 0.922 | | torchbench | speech_transformer | 0.8727 | 0.8767 | | torchbench | timm_vision_transformer_large | 0.8604 | 1.0311 | | torchbench | resnet50 | 0.8573 | 0.9239 | | torchbench | mnasnet1_0 | 0.8542 | 0.8748 | | torchbench | resnext50_32x4d | 0.83 | 0.8374 | | torchbench | hf_BigBird | 0.8211 | 1.0381 | | torchbench | hf_Albert | 0.7812 | 1.2212 | | torchbench | dcgan | 0.7633 | 0.8875 | | torchbench | drq | 0.7632 | 0.8778 | | torchbench | timm_vovnet | 0.7618 | 0.9529 | | torchbench | hf_Bart | 0.7542 | 1.0064 | | torchbench | timm_vision_transformer | 0.7519 | 0.8216 | | torchbench | soft_actor_critic | 0.75 | 0.9991 | | torchbench | alexnet | 0.743 | 0.8332 | | torchbench | fastNLP_Bert | 0.7406 | 1.1229 | | torchbench | densenet121 | 0.7213 | 0.7236 | | torchbench | BERT_pytorch | 0.7067 | 0.9033 | | torchbench | resnet18 | 0.6901 | 0.7 | | torchbench | LearningToPaint | 0.6798 | 0.7079 | | torchbench | vgg16 | 0.6637 | 0.9554 | | torchbench | hf_Bert | 0.6432 | 0.8995 | | torchbench | hf_DistilBert | 0.613 | 0.8537 | | torchbench | lennard_jones | 0.5646 | 0.9989 | | torchbench | nvidia_deeprecommender | 0.5598 | 0.5598 | | torchbench | hf_Reformer | 0.5232 | 0.9892 | | torchbench | attention_is_all_you_need_pytorch | 0.4429 | 0.5961 | | torchbench | pytorch_struct | 0.4222 | 0.4335 | | torchbench | dlrm | nan | 0.7306 | | huggingface | T5Small | 0.8453 | 1.0257 | | huggingface | T5ForConditionalGeneration | 0.8215 | 1.0502 | | huggingface | BigBird | 0.8186 | 1.0078 | | huggingface | XGLMForCausalLM | 0.8157 | 0.8962 | | huggingface | DistillGPT2 | 0.8057 | 0.9257 | | huggingface | YituTechConvBert | 0.791 | 0.8725 | | huggingface | PegasusForConditionalGeneration | 0.7893 | 0.9466 | | huggingface | GoogleFnet | 0.7676 | 0.9364 | | huggingface | M2M100ForConditionalGeneration | 0.7661 | 0.9595 | | huggingface | MT5ForConditionalGeneration | 0.7625 | 0.8497 | | huggingface | CamemBert | 0.7151 | 0.8696 | | huggingface | BartForConditionalGeneration | 0.6979 | 0.8969 | | huggingface | PLBartForCausalLM | 0.6852 | 0.806 | | huggingface | PegasusForCausalLM | 0.6791 | 0.8947 | | huggingface | BlenderbotSmallForCausalLM | 0.6618 | 0.7576 | | huggingface | PLBartForConditionalGeneration | 0.6553 | 0.8256 | | huggingface | MegatronBertForQuestionAnswering | 0.6467 | 0.797 | | huggingface | OPTForCausalLM | 0.6407 | 0.8248 | | huggingface | BartForCausalLM | 0.6359 | 0.8919 | | huggingface | MBartForConditionalGeneration | 0.63 | 0.7668 | | huggingface | MegatronBertForCausalLM | 0.6276 | 0.7821 | | huggingface | LayoutLMForSequenceClassification | 0.6247 | 0.9889 | | huggingface | BlenderbotSmallForConditionalGeneration | 0.6148 | 0.8546 | | huggingface | TrOCRForCausalLM | 0.6094 | 0.7677 | | huggingface | MBartForCausalLM | 0.6078 | 0.7715 | | huggingface | ElectraForQuestionAnswering | 0.6054 | 0.9848 | | huggingface | DistilBertForMaskedLM | 0.6017 | 0.8152 | | huggingface | DistilBertForQuestionAnswering | 0.595 | 0.7558 | | huggingface | Speech2Text2ForCausalLM | 0.5787 | 0.8128 | | huggingface | BertForMaskedLM | 0.5613 | 0.7534 | | huggingface | RobertaForCausalLM | 0.5604 | 0.7519 | | huggingface | MobileBertForMaskedLM | 0.4624 | 0.6 | | huggingface | DebertaForMaskedLM | 0.386 | 0.9713 | | huggingface | MobileBertForQuestionAnswering | 0.3725 | 0.4638 | | huggingface | DebertaForQuestionAnswering | 0.2902 | 1.1588 | | huggingface | ElectraForCausalLM | nan | 0.8074 | | huggingface | BertForQuestionAnswering | nan | 0.6814 | | huggingface | RobertaForQuestionAnswering | nan | 0.6814 | | timm_models | swsl_resnext101_32x16d | 0.8933 | 0.9945 | | timm_models | lcnet_050 | 0.8842 | 0.9126 | | timm_models | res2net50_14w_8s | 0.8824 | 1.0114 | | timm_models | regnety_002 | 0.8622 | 1.0414 | | timm_models | botnet26t_256 | 0.8605 | 0.9611 | | timm_models | swin_base_patch4_window7_224 | 0.8514 | 1.0359 | | timm_models | sebotnet33ts_256 | 0.8365 | 0.9651 | | timm_models | pit_b_224 | 0.8169 | 1.0651 | | timm_models | gernet_l | 0.7928 | 0.9925 | | timm_models | resmlp_12_224 | 0.7768 | 0.7845 | | timm_models | coat_lite_mini | 0.7193 | 1.0063 | | timm_models | convit_base | 0.6848 | 0.8081 | | timm_models | crossvit_9_240 | 0.579 | 0.7469 | | timm_models | repvgg_a2 | 0.5321 | 0.8171 | | timm_models | tnt_s_patch16_224 | nan | 0.7096 | +-------------+-----------------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-10/passrate_over_time.png : ![](https://i.imgur.com/41CGVeZ.png) ../test-dynamo-runner-logs-10/geomean_over_time.png : ![](https://i.imgur.com/Mol0wlp.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that run actually the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with float32 precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0034 | 1.0103 | 2.2543 | 0.7766 | 5.1745 | 1.2527 | | timm_efficientdet | 1 | 0.9835 | 0.8909 | 1.7521 | 0.7524 | 4.114 | 1.5087 | | timm_vision_transformer | 8 | 1.0092 | 0.9287 | 1.608 | 0.6875 | 2.5316 | 1.4007 | | drq | 1 | 1.0172 | 0.8738 | 1.7099 | 0.7494 | 2.4531 | 1.061 | | BERT_pytorch | 16 | 1.0111 | 0.8916 | 1.1172 | 0.9867 | 2.0244 | 1.9673 | | resnext50_32x4d | 8 | 0.9992 | 1.1109 | 1.1991 | 0.803 | 1.9921 | 1.1882 | | mobilenet_v3_large | 32 | 1.007 | 1.1139 | 1.0545 | 0.8518 | 1.9793 | 1.3179 | | resnet18 | 16 | 1.0025 | 1.1216 | 1.1901 | 0.8876 | 1.9281 | 1.2411 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9951 | 1.0103 | 1.3626 | 0.8518 | 1.8356 | 1.4935 | | pytorch_struct | 200 | 0.9882 | 0.7443 | 0.8921 | 0.7831 | 1.8189 | 1.1447 | | lennard_jones | 1000 | 0.9606 | 0.8635 | 1.065 | 0.6828 | 1.7979 | 0.938 | | squeezenet1_1 | 32 | 1.001 | 1.0147 | 1.0601 | 0.8746 | 1.7083 | 1.2377 | | dcgan | 32 | 0.9887 | 1.0291 | 1.282 | 0.808 | 1.6563 | 1.0485 | | hf_T5_large | 2 | 1.0219 | 0.8835 | 0.0 | 0.0 | 1.6172 | 1.6342 | | hf_Albert | 8 | 1.0016 | 0.9965 | 0.751 | 1.556 | 1.6095 | 1.6043 | | shufflenet_v2_x1_0 | 128 | 0.9997 | 1.0484 | 0.817 | 0.9075 | 1.5732 | 1.3943 | | speech_transformer | 32 | 1.0044 | 0.9412 | 1.5497 | 0.8193 | 1.5397 | 1.5331 | | mnasnet1_0 | 32 | 0.9979 | 1.0984 | 0.8721 | 0.9118 | 1.507 | 1.2612 | | hf_GPT2 | 4 | 1.0076 | 0.9791 | 0.7395 | 0.3827 | 1.5043 | 1.4947 | | timm_resnest | 32 | 0.999 | 1.0031 | 0.8062 | 1.166 | 1.4924 | 1.4288 | | timm_nfnet | 128 | 0.9996 | 1.0001 | 0.0 | 1.1344 | 1.4334 | 1.3801 | | soft_actor_critic | 256 | 0.9997 | 0.8097 | 1.0428 | 0.6938 | 1.433 | 0.9407 | | mobilenet_v2_quantized_qat | 96 | 1.0014 | 0.9815 | 0.0 | 1.4601 | 1.392 | 1.3919 | | fastNLP_Bert | 6 | 0.9991 | 0.9767 | 0.7518 | 1.2015 | 1.3843 | 1.3586 | | mobilenet_v2 | 96 | 0.9998 | 0.9987 | 0.7308 | 1.3366 | 1.3605 | 1.3361 | | resnet50_quantized_qat | 32 | 1.0003 | 0.971 | 0.0 | 1.154 | 1.3546 | 1.3578 | | timm_efficientnet | 32 | 0.9586 | 0.8127 | 0.696 | 0.8204 | 1.3386 | 1.1873 | | LearningToPaint | 96 | 1.0023 | 1.0563 | 0.869 | 0.986 | 1.2829 | 1.1973 | | pytorch_stargan | 16 | 0.9996 | 1.0753 | 0.9326 | 0.0 | 1.2384 | 1.1985 | | resnet152 | 32 | 1.0025 | 1.0581 | 0.8027 | 0.9154 | 1.2088 | 1.1668 | | hf_Bert | 4 | 1.037 | 0.9974 | 0.7437 | 0.7775 | 1.1759 | 1.1566 | | resnet50 | 32 | 0.9992 | 0.992 | 0.7637 | 0.9784 | 1.1672 | 1.1337 | | hf_Bart | 4 | 1.0126 | 0.974 | 0.7518 | 0.915 | 1.1647 | 1.1523 | | pytorch_unet | 1 | 0.9998 | 0.9981 | 0.8462 | 1.0902 | 1.1629 | 1.1509 | | hf_DistilBert | 8 | 1.0007 | 0.9558 | 0.6869 | 0.5303 | 1.1527 | 1.1554 | | vgg16 | 64 | 0.9999 | 0.9991 | 0.8587 | 0.9977 | 1.1454 | 1.1402 | | alexnet | 128 | 1.0001 | 0.9976 | 0.8037 | 1.0035 | 1.1394 | 1.1429 | | hf_Reformer | 4 | 0.998 | 1.0016 | 0.9882 | 0.7361 | 1.1279 | 1.1374 | | Super_SloMo | 6 | 1.0004 | 0.9964 | 0.8671 | 0.9926 | 1.1093 | 1.0942 | | timm_regnet | 32 | 0.9664 | 0.9624 | 0.7815 | 1.0485 | 1.1053 | 1.0688 | | hf_BigBird | 2 | 0.9899 | 0.939 | 0.9418 | 0.9063 | 1.0943 | 1.0013 | | Background_Matting | 4 | 1.0003 | 1.0225 | 0.8687 | 1.0792 | 1.0867 | 1.0784 | | yolov3 | 16 | 0.9998 | 0.9945 | 0.7904 | 1.1522 | 1.0623 | 1.0489 | | timm_vision_transformer_large | 8 | 1.0001 | 0.9951 | 0.0 | 0.0 | 1.0279 | 1.0133 | | attention_is_all_you_need_pytorch | 256 | 1.0002 | 0.968 | 0.7555 | 0.953 | 1.021 | 1.008 | | tts_angular | 64 | 0.9865 | 0.9616 | 0.984 | 0.9592 | 1.0085 | 1.0322 | | demucs | 4 | 0.9996 | 1.0003 | 1.0002 | 0.9999 | 0.9995 | 0.9998 | | timm_vovnet | 32 | 0.9092 | 0.9037 | 0.7137 | 0.8968 | 0.9682 | 1.0192 | | nvidia_deeprecommender | 256 | 0.9995 | 0.963 | 0.5845 | 0.9762 | 0.9039 | 0.9639 | | dlrm | 2048 | 0.0 | 1.0818 | 0.0 | 0.0 | 0.0 | 1.0262 | | hf_GPT2_large | 4 | 0.9993 | 0.9796 | 0.0 | 0.0 | 0.0 | 1.4738 | | hf_T5 | 8 | 0.9994 | 0.9522 | 0.0 | 1.2231 | 0.0 | 1.5418 | | tacotron2 | 64 | 0.9826 | 0.8497 | 0.0 | 0.7589 | 0.0 | 0.9121 | | functorch_dp_cifar10 | 64 | 1.0045 | 1.025 | 2.1325 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.9544 | 0.8846 | 0.8089 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | tts_angular | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | mobilenet_v3_large | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | pass | fail_to_run | pass | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | resnet50_quantized_qat | 2 | pass | pass | fail_to_run | pass | fail_accuracy | fail_accuracy | | mobilenet_v2_quantized_qat | 2 | pass | fail_accuracy | fail_to_run | fail_accuracy | fail_accuracy | fail_accuracy | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.8615 | 7.1037 | 10.1282 | 114.5949 | 381.5792 | 380.389 | | densenet121 | 4 | 2.207 | 9.5954 | 15.736 | 169.9459 | 220.1328 | 225.51 | | timm_efficientdet | 1 | 19.6713 | 32.5827 | 68.3917 | 487.0413 | 174.1247 | 172.4918 | | hf_T5_large | 2 | 13.9787 | 34.7788 | nan | nan | 107.9363 | 105.7776 | | mobilenet_v3_large | 32 | 0.859 | 3.7618 | 5.8253 | 99.5179 | 77.0062 | 77.2105 | | timm_efficientnet | 32 | 1.6968 | 5.4904 | 13.6965 | 107.0236 | 71.9889 | 69.9666 | | mnasnet1_0 | 32 | 0.7655 | 3.4608 | 5.2654 | 72.8695 | 65.8197 | 65.3928 | | resnet152 | 32 | 2.2326 | 10.5466 | 17.5922 | 187.9625 | 57.4004 | 56.1098 | | timm_vision_transformer_large | 8 | 2.4621 | 11.0303 | nan | nan | 52.2535 | 50.3743 | | attention_is_all_you_need_pytorch | 256 | 1.1204 | 5.6824 | 8.9891 | 135.4939 | 48.3539 | 47.4075 | | timm_resnest | 32 | 0.5614 | 2.0289 | 3.0408 | 60.2911 | 48.1243 | 48.5556 | | mobilenet_v2 | 96 | 0.7322 | 3.631 | 5.8725 | 95.3635 | 44.8839 | 43.784 | | resnext50_32x4d | 8 | 0.8582 | 3.6011 | 5.4643 | 66.1358 | 42.9859 | 42.8601 | | hf_BigBird | 2 | 7.5769 | 12.9047 | 25.1926 | 94.4403 | 38.9475 | 25.4375 | | timm_nfnet | 128 | 1.9778 | 6.143 | nan | 156.4433 | 37.2983 | 37.4964 | | timm_regnet | 32 | 2.2718 | 6.4701 | 17.4033 | 112.6871 | 34.0088 | 32.9632 | | speech_transformer | 32 | 1.6851 | 6.6232 | 25.9671 | 149.9582 | 33.4517 | 32.796 | | resnet50 | 32 | 0.8116 | 3.7146 | 5.5251 | 80.2617 | 32.7314 | 32.2849 | | timm_vision_transformer | 8 | 0.7809 | 3.3375 | 5.1937 | 72.2268 | 31.5882 | 31.47 | | resnet50_quantized_qat | 32 | 1.1143 | 7.1755 | nan | 171.1764 | 30.9261 | 30.7577 | | hf_Bart | 4 | 1.5868 | 6.633 | 10.3413 | 142.1477 | 30.1735 | 29.1572 | | BERT_pytorch | 16 | 1.4885 | 5.8264 | 8.9072 | 103.9845 | 29.4993 | 28.9117 | | fastNLP_Bert | 6 | 1.4772 | 5.3583 | 9.1292 | 97.5961 | 29.3183 | 27.3016 | | shufflenet_v2_x1_0 | 128 | 0.9118 | 4.0386 | 6.2543 | 91.2968 | 28.1045 | 27.2492 | | mobilenet_v2_quantized_qat | 96 | 1.2512 | 7.3048 | nan | 184.4399 | 27.7401 | 27.7527 | | pytorch_stargan | 16 | 0.3929 | 1.6936 | 2.545 | nan | 27.4244 | 22.3643 | | hf_Bert | 4 | 1.5098 | 5.3054 | 7.7941 | 100.8382 | 20.5858 | 19.938 | | hf_Reformer | 4 | 1.7163 | 2.9454 | 5.4898 | 15.9518 | 19.6649 | 15.9316 | | squeezenet1_1 | 32 | 0.24 | 0.6372 | 1.0892 | 4.5874 | 19.0315 | 19.5053 | | pytorch_struct | 200 | 0.2408 | 0.6286 | 1.1644 | 4.5881 | 18.8175 | 18.4758 | | hf_Albert | 8 | 1.1857 | 4.6165 | 7.4435 | 112.8366 | 18.7934 | 17.9094 | | timm_vovnet | 32 | 1.4808 | 3.7301 | 8.8841 | 58.2484 | 17.8465 | 17.298 | | Background_Matting | 4 | 0.6313 | 3.3843 | 5.114 | 70.8082 | 16.988 | 15.8652 | | hf_GPT2 | 4 | 1.4345 | 5.3286 | 7.6791 | 83.5561 | 16.9073 | 16.2582 | | resnet18 | 16 | 0.3957 | 1.4502 | 2.1446 | 30.1423 | 16.4725 | 16.0141 | | Super_SloMo | 6 | 0.8428 | 3.4613 | 4.7359 | 27.819 | 14.7767 | 14.4067 | | hf_DistilBert | 8 | 0.5942 | 2.6217 | 4.8021 | 46.6028 | 12.8265 | 12.4083 | | LearningToPaint | 96 | 0.41 | 1.5175 | 2.3654 | 39.7655 | 11.5059 | 11.671 | | pytorch_unet | 1 | 0.3591 | 1.395 | 2.1495 | 30.8384 | 8.3431 | 8.0649 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.3631 | 1.5792 | 2.3379 | 33.4569 | 8.033 | 7.8091 | | drq | 1 | 0.2913 | 0.4913 | 0.8427 | 4.5165 | 3.7102 | 3.1808 | | vgg16 | 64 | 0.1666 | 0.4521 | 0.8391 | 2.8629 | 3.5365 | 3.3675 | | soft_actor_critic | 256 | 0.1986 | 0.2991 | 0.4937 | 1.6 | 3.3884 | 2.6819 | | nvidia_deeprecommender | 256 | 0.1974 | 0.3818 | 0.6382 | 4.5543 | 3.2374 | 3.0681 | | alexnet | 128 | 0.1557 | 0.3154 | 0.5925 | 3.1259 | 2.9831 | 2.7305 | | dcgan | 32 | 0.1672 | 0.3594 | 0.5688 | 4.451 | 2.6603 | 2.4572 | | lennard_jones | 1000 | 0.1394 | 0.2511 | 0.3926 | 1.3918 | 1.9507 | 1.7885 | | tts_angular | 64 | 0.2088 | 0.2535 | 0.3788 | 1.1001 | 1.8957 | 1.7633 | | demucs | 4 | 0.2978 | 0.2946 | 0.3028 | 0.2986 | 0.2053 | 0.2031 | | tacotron2 | 64 | 17.2102 | 28.8807 | nan | 64.3264 | nan | 63.2971 | | hf_GPT2_large | 4 | 5.163 | 15.9791 | nan | nan | nan | 41.7674 | | hf_T5 | 8 | 2.4042 | 7.7559 | nan | 88.6347 | nan | 27.2998 | | dlrm | 2048 | nan | 0.7149 | nan | nan | nan | 2.8923 | | hf_Longformer | 2 | 6.0557 | 13.1005 | 55.6715 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 0.3478 | 1.342 | 2.0427 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | mobilenet_v2_quantized_qat | 96 | 0.9957 | 0.8276 | nan | 1.1946 | 1.5829 | 1.5829 | | resnet50_quantized_qat | 32 | 0.9967 | 0.9152 | nan | 1.226 | 1.4894 | 1.4897 | | timm_efficientnet | 32 | 0.9937 | 0.7666 | 0.2634 | 0.988 | 1.3109 | 1.3998 | | Super_SloMo | 6 | 1.0024 | 0.9527 | 0.363 | 0.9891 | 1.2025 | 1.3825 | | mobilenet_v2 | 96 | 0.9928 | 0.7624 | 0.3062 | 0.9872 | 1.1753 | 1.2588 | | timm_efficientdet | 1 | 1.011 | 0.823 | 0.2888 | 1.1341 | 1.1191 | 1.1506 | | timm_nfnet | 128 | 0.936 | 0.8937 | nan | 0.7594 | 1.0224 | 1.0903 | | demucs | 4 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | 0.9886 | | Background_Matting | 4 | 0.9998 | 0.9492 | 0.3596 | 0.9682 | 0.9832 | 1.0227 | | tts_angular | 64 | 0.9884 | 0.9884 | 0.9829 | 0.9884 | 0.983 | 0.9884 | | shufflenet_v2_x1_0 | 128 | 0.9739 | 0.8944 | 0.3499 | 0.8683 | 0.9815 | 1.0321 | | hf_GPT2 | 4 | 0.9548 | 0.906 | 0.3702 | 1.1241 | 0.9703 | 1.1374 | | timm_regnet | 32 | 0.9985 | 0.8614 | 0.3327 | 0.8784 | 0.9406 | 1.0801 | | yolov3 | 16 | 0.9957 | 0.844 | 0.334 | 0.8549 | 0.9237 | 1.1004 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9981 | 0.9166 | 0.3917 | 0.8956 | 0.9178 | 0.9981 | | pytorch_unet | 1 | 0.9985 | 0.8521 | 0.3441 | 0.8497 | 0.9116 | 0.9476 | | resnet152 | 32 | 0.9975 | 0.9157 | 0.3424 | 0.8735 | 0.9072 | 0.9616 | | pytorch_stargan | 16 | 0.9975 | 1.0179 | 0.4129 | nan | 0.9023 | 1.0693 | | timm_resnest | 32 | 0.9931 | 0.88 | 0.3236 | 0.7927 | 0.8982 | 1.0013 | | squeezenet1_1 | 32 | 0.9749 | 0.8159 | 0.3373 | 0.9761 | 0.8931 | 0.9113 | | mobilenet_v3_large | 32 | 0.9878 | 0.8563 | 0.3277 | 0.8098 | 0.8834 | 0.902 | | hf_T5_large | 2 | 0.922 | 0.8673 | nan | nan | 0.8736 | 0.922 | | speech_transformer | 32 | 0.9977 | 0.9148 | 0.2709 | 1.021 | 0.8727 | 0.8767 | | timm_vision_transformer_large | 8 | 0.9998 | 0.8416 | nan | nan | 0.8604 | 1.0311 | | resnet50 | 32 | 0.9942 | 0.8719 | 0.3368 | 0.7968 | 0.8573 | 0.9239 | | mnasnet1_0 | 32 | 0.9869 | 0.8985 | 0.333 | 0.8259 | 0.8542 | 0.8748 | | resnext50_32x4d | 8 | 0.9952 | 0.8668 | 0.3592 | 0.8203 | 0.83 | 0.8374 | | hf_BigBird | 2 | 0.9608 | 0.9608 | 0.4299 | 1.1745 | 0.8211 | 1.0381 | | hf_Albert | 8 | 0.9332 | 0.9332 | 0.2846 | 1.0621 | 0.7812 | 1.2212 | | dcgan | 32 | 0.9754 | 0.7634 | 0.4581 | 0.7634 | 0.7633 | 0.8875 | | drq | 1 | 0.987 | 0.8777 | 0.4252 | 0.8777 | 0.7632 | 0.8778 | | timm_vovnet | 32 | 0.9933 | 0.7603 | 0.3202 | 0.7737 | 0.7618 | 0.9529 | | hf_Bart | 4 | 0.9617 | 0.8774 | 0.3384 | 1.0863 | 0.7542 | 1.0064 | | timm_vision_transformer | 8 | 0.9943 | 0.8835 | 0.3312 | 1.0642 | 0.7519 | 0.8216 | | soft_actor_critic | 256 | 0.9997 | 0.9637 | 0.4355 | 0.9636 | 0.75 | 0.9991 | | alexnet | 128 | 0.9542 | 0.745 | 0.4163 | 0.7457 | 0.743 | 0.8332 | | fastNLP_Bert | 6 | 1.0011 | 0.9152 | 0.3384 | 1.2131 | 0.7406 | 1.1229 | | densenet121 | 4 | 0.9904 | 0.8812 | 0.3439 | 0.8558 | 0.7213 | 0.7236 | | BERT_pytorch | 16 | 1.0 | 0.898 | 0.3504 | 1.125 | 0.7067 | 0.9033 | | resnet18 | 16 | 0.9831 | 0.7792 | 0.3589 | 0.6948 | 0.6901 | 0.7 | | LearningToPaint | 96 | 0.9452 | 0.6912 | 0.3387 | 0.6507 | 0.6798 | 0.7079 | | vgg16 | 64 | 0.9944 | 0.6638 | 0.3214 | 0.664 | 0.6637 | 0.9554 | | hf_Bert | 4 | 0.9683 | 0.9018 | 0.3526 | 1.0011 | 0.6432 | 0.8995 | | hf_DistilBert | 8 | 0.9211 | 0.9047 | 0.3214 | 1.0216 | 0.613 | 0.8537 | | lennard_jones | 1000 | 0.9995 | 0.9995 | 0.3711 | 0.9995 | 0.5646 | 0.9989 | | nvidia_deeprecommender | 256 | 0.5598 | 0.5598 | 0.4624 | 0.5598 | 0.5598 | 0.5598 | | hf_Reformer | 4 | 0.9872 | 0.9865 | 0.5793 | 0.9862 | 0.5232 | 0.9892 | | attention_is_all_you_need_pytorch | 256 | 0.9476 | 0.9243 | 0.2962 | 0.9678 | 0.4429 | 0.5961 | | pytorch_struct | 200 | 1.0 | 0.5079 | 0.4824 | 0.5097 | 0.4222 | 0.4335 | | tacotron2 | 64 | 0.9906 | 1.0301 | nan | 1.0227 | nan | 1.1618 | | hf_T5 | 8 | 0.9527 | 0.9415 | nan | 0.9326 | nan | 1.1462 | | hf_GPT2_large | 4 | 0.936 | 0.8833 | nan | nan | nan | 1.1258 | | dlrm | 2048 | nan | 0.7306 | nan | nan | nan | 0.7306 | | functorch_dp_cifar10 | 64 | 0.9961 | 0.8224 | 0.4445 | nan | nan | nan | | hf_Longformer | 2 | 0.9603 | 0.9604 | 0.2945 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 196.2037 | 197.1578 | nan | nan | 191.4221 | 193.5334 | | hf_BigBird | 2 | 190.1664 | 201.9389 | 194.3638 | 206.6825 | 177.4783 | 191.1024 | | Background_Matting | 4 | 186.5218 | 182.2681 | 214.6763 | 172.7999 | 171.6872 | 172.9331 | | timm_nfnet | 128 | 205.6384 | 205.8111 | nan | 181.1867 | 143.9203 | 149.0891 | | hf_T5_large | 2 | 222.6584 | 258.7636 | nan | nan | 120.4652 | 122.9878 | | mobilenet_v2_quantized_qat | 96 | 147.5036 | 149.857 | nan | 100.8759 | 106.4492 | 106.0634 | | Super_SloMo | 6 | 117.407 | 117.8498 | 135.5326 | 118.1915 | 106.1066 | 107.4231 | | yolov3 | 16 | 102.0974 | 102.6878 | 128.8773 | 88.8896 | 96.1829 | 97.4202 | | vgg16 | 64 | 106.182 | 106.3712 | 123.9653 | 106.3186 | 92.7065 | 93.0444 | | timm_regnet | 32 | 101.6377 | 101.7461 | 125.6072 | 93.6481 | 88.4908 | 91.7972 | | demucs | 4 | 77.8506 | 77.7653 | 77.7111 | 77.6892 | 77.9116 | 77.6122 | | resnet152 | 32 | 90.3118 | 85.1439 | 112.9332 | 105.6081 | 75.3426 | 77.8862 | | hf_Reformer | 4 | 83.2553 | 83.0531 | 84.0756 | 112.8907 | 73.7978 | 73.1662 | | attention_is_all_you_need_pytorch | 256 | 71.8557 | 74.2097 | 95.4696 | 75.2574 | 70.5959 | 71.3955 | | resnet50_quantized_qat | 32 | 93.0616 | 97.0726 | nan | 81.2646 | 69.2829 | 69.2294 | | mobilenet_v2 | 96 | 71.3399 | 71.4866 | 97.6214 | 53.3638 | 52.4706 | 53.4617 | | pytorch_unet | 1 | 58.6019 | 58.6446 | 69.0888 | 53.6281 | 50.3426 | 50.878 | | hf_Bart | 4 | 55.1957 | 56.6696 | 74.6043 | 59.6883 | 47.7886 | 47.6929 | | hf_Albert | 8 | 74.8313 | 75.1101 | 99.8974 | 48.1113 | 46.6384 | 46.5938 | | fastNLP_Bert | 6 | 59.6604 | 60.8315 | 80.3386 | 49.4814 | 43.2236 | 43.7049 | | timm_vovnet | 32 | 42.3671 | 42.6047 | 53.9644 | 42.8812 | 39.8212 | 39.2402 | | timm_efficientdet | 1 | 141.8198 | 152.0639 | 86.8265 | 183.3734 | 38.8933 | 102.8539 | | hf_DistilBert | 8 | 38.8379 | 40.6368 | 56.6624 | 73.2702 | 33.8138 | 33.5592 | | hf_GPT2 | 4 | 49.8723 | 53.8239 | 68.2085 | 138.2845 | 33.6281 | 33.5311 | | hf_Bert | 4 | 38.0626 | 39.1704 | 53.3751 | 50.2468 | 33.578 | 33.9581 | | resnet50 | 32 | 38.716 | 38.8574 | 50.6056 | 39.5381 | 33.1896 | 34.0345 | | speech_transformer | 32 | 54.0943 | 57.501 | 32.5768 | 58.9462 | 32.9094 | 32.9242 | | timm_efficientnet | 32 | 44.8429 | 51.7822 | 60.8621 | 51.8153 | 32.2923 | 38.8633 | | shufflenet_v2_x1_0 | 128 | 38.9357 | 35.1488 | 45.857 | 43.7256 | 25.0986 | 26.6214 | | BERT_pytorch | 16 | 55.2613 | 51.5522 | 42.004 | 47.3254 | 23.8863 | 23.9738 | | timm_resnest | 32 | 31.7201 | 31.533 | 39.2515 | 27.1091 | 21.1669 | 22.0762 | | mnasnet1_0 | 32 | 28.7116 | 25.8132 | 33.1104 | 31.697 | 20.0312 | 22.8623 | | pytorch_stargan | 16 | 24.2095 | 22.4783 | 25.9652 | nan | 19.5301 | 20.1333 | | mobilenet_v3_large | 32 | 31.6457 | 27.9603 | 30.4113 | 39.3104 | 16.513 | 24.2222 | | resnext50_32x4d | 8 | 26.7726 | 23.5412 | 22.1319 | 33.5073 | 13.7119 | 22.6145 | | densenet121 | 4 | 74.0353 | 62.6752 | 28.6974 | 86.0433 | 13.1056 | 52.9852 | | LearningToPaint | 96 | 15.745 | 14.7146 | 18.1615 | 15.8778 | 12.8009 | 13.7361 | | alexnet | 128 | 12.4398 | 12.423 | 15.4583 | 12.387 | 10.8959 | 10.8855 | | timm_vision_transformer | 8 | 23.8571 | 25.0941 | 16.2589 | 37.7166 | 9.8079 | 17.3686 | | nvidia_deeprecommender | 256 | 8.5322 | 8.8504 | 14.5787 | 8.7373 | 9.4255 | 8.854 | | pytorch_CycleGAN_and_pix2pix | 1 | 16.3097 | 16.1042 | 12.6263 | 19.8478 | 9.3375 | 11.2084 | | tts_angular | 64 | 9.3702 | 9.8407 | 9.5836 | 9.7957 | 9.2237 | 9.2179 | | squeezenet1_1 | 32 | 13.3195 | 12.3102 | 12.0337 | 14.7156 | 7.6221 | 10.318 | | resnet18 | 16 | 12.9591 | 10.8336 | 10.3128 | 13.5854 | 6.7372 | 9.9563 | | pytorch_struct | 200 | 3.7837 | 4.9733 | 4.247 | 5.3221 | 2.0908 | 3.3425 | | dcgan | 32 | 2.7465 | 2.5747 | 2.1002 | 3.4157 | 1.6434 | 2.563 | | drq | 1 | 2.8697 | 3.4959 | 1.7083 | 5.1146 | 1.2508 | 2.7949 | | soft_actor_critic | 256 | 1.0085 | 1.306 | 0.9541 | 1.5237 | 0.7395 | 1.1886 | | lennard_jones | 1000 | 1.1148 | 1.3085 | 1.0279 | 1.623 | 0.6388 | 1.2125 | | tacotron2 | 64 | 2745.6168 | 3112.4221 | nan | 4123.8823 | nan | 3061.2586 | | dlrm | 2048 | nan | 488.6141 | nan | nan | nan | 495.3396 | | hf_GPT2_large | 4 | 240.7231 | 245.7275 | nan | nan | nan | 163.1488 | | hf_T5 | 8 | 182.4751 | 191.7887 | nan | 149.127 | nan | 118.2279 | | hf_Longformer | 2 | 144.6938 | 156.2746 | 171.4435 | nan | nan | nan | | functorch_dp_cifar10 | 64 | 11.5783 | 10.999 | 5.4812 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with float32 precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0247 | 0.9054 | 1.7558 | 0.755 | 3.29 | 1.442 | | DistillGPT2 | 1 | 1.0304 | 0.9271 | 1.0444 | 0.0 | 2.3582 | 1.7705 | | CamemBert | 1 | 1.0525 | 0.9243 | 1.3155 | 0.741 | 2.3087 | 1.4984 | | MT5ForConditionalGeneration | 8 | 1.0265 | 0.9204 | 1.4082 | 1.0165 | 2.182 | 1.8958 | | GoogleFnet | 1 | 0.9843 | 0.801 | 0.9796 | 0.0 | 2.0693 | 1.1642 | | MobileBertForMaskedLM | 32 | 1.0247 | 0.9113 | 1.1339 | 0.0 | 2.0154 | 1.5286 | | GPT2ForSequenceClassification | 4 | 1.0005 | 0.977 | 0.0 | 0.7088 | 1.7936 | 1.7784 | | T5ForConditionalGeneration | 4 | 1.0023 | 0.9362 | 0.7236 | 1.1421 | 1.4221 | 1.4149 | | MobileBertForQuestionAnswering | 64 | 1.0243 | 0.9355 | 1.0098 | 0.0 | 1.3908 | 1.2175 | | ElectraForQuestionAnswering | 64 | 1.0006 | 0.9848 | 0.0 | 1.2388 | 1.3583 | 1.3417 | | T5Small | 1 | 1.0177 | 0.9309 | 0.9705 | 0.9883 | 1.3571 | 1.1326 | | M2M100ForConditionalGeneration | 8 | 1.0127 | 0.9627 | 0.985 | 0.8155 | 1.3536 | 1.2633 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9887 | 0.7373 | 1.1463 | 1.2578 | 1.2478 | | MegatronBertForQuestionAnswering | 16 | 1.043 | 1.0143 | 0.7595 | 0.8738 | 1.1961 | 1.0987 | | MegatronBertForCausalLM | 16 | 1.032 | 1.0051 | 0.7382 | 0.9412 | 1.1802 | 1.0878 | | OPTForCausalLM | 32 | 1.0033 | 0.9331 | 0.7113 | 0.4574 | 1.1559 | 1.1744 | | PLBartForConditionalGeneration | 16 | 1.0148 | 0.9728 | 0.8443 | 0.826 | 1.1506 | 1.144 | | XGLMForCausalLM | 8 | 1.011 | 0.9401 | 0.7406 | 0.3212 | 1.1393 | 1.1472 | | DistilBertForQuestionAnswering | 64 | 0.9996 | 0.9836 | 0.7131 | 0.5226 | 1.1266 | 1.1067 | | RobertaForCausalLM | 64 | 1.0003 | 0.9637 | 0.746 | 0.9729 | 1.1075 | 1.106 | | BigBird | 1 | 0.9911 | 0.9352 | 0.9991 | 0.0 | 1.0935 | 0.9987 | | DebertaForMaskedLM | 4 | 0.9147 | 0.7904 | 0.7291 | 0.6373 | 1.08 | 1.029 | | MBartForConditionalGeneration | 16 | 1.0122 | 0.9801 | 0.7579 | 0.0 | 1.0605 | 1.0399 | | BartForConditionalGeneration | 2 | 1.0004 | 0.9871 | 0.0 | 0.4452 | 1.0534 | 1.0449 | | Speech2Text2ForCausalLM | 128 | 0.9993 | 0.9245 | 0.6616 | 0.9244 | 1.0399 | 1.0588 | | DebertaForQuestionAnswering | 8 | 0.997 | 0.9758 | 0.6828 | 0.8671 | 1.0385 | 1.2049 | | PegasusForConditionalGeneration | 16 | 1.0086 | 0.9779 | 0.7595 | 0.883 | 1.0346 | 1.0429 | | BartForCausalLM | 4 | 1.0005 | 0.9665 | 0.7547 | 0.9968 | 1.0219 | 1.0299 | | BertForMaskedLM | 64 | 1.0002 | 0.9605 | 0.7295 | 0.9724 | 1.0123 | 1.0101 | | DistilBertForMaskedLM | 64 | 0.9998 | 0.9505 | 0.7127 | 0.6347 | 0.9987 | 1.0148 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0011 | 0.9394 | 0.0 | 0.9264 | 0.9974 | 1.0021 | | PLBartForCausalLM | 32 | 1.0071 | 0.9311 | 0.7166 | 0.9085 | 0.9668 | 0.9935 | | TrOCRForCausalLM | 32 | 1.0016 | 0.9548 | 0.7342 | 0.9467 | 0.9564 | 0.9663 | | MBartForCausalLM | 32 | 1.0018 | 0.9542 | 0.7318 | 0.0 | 0.9513 | 0.9623 | | PegasusForCausalLM | 32 | 0.9998 | 0.9525 | 0.7316 | 0.94 | 0.947 | 0.9559 | | BlenderbotSmallForCausalLM | 64 | 1.0012 | 0.9093 | 0.683 | 0.9134 | 0.9317 | 0.9605 | | LayoutLMForMaskedLM | 16 | 1.0003 | 0.9695 | 0.0 | 1.0841 | 0.0 | 1.1632 | | RobertaForQuestionAnswering | 128 | 1.0004 | 0.9926 | 0.0 | 1.0284 | 0.0 | 1.0593 | | BertForQuestionAnswering | 128 | 1.0001 | 0.993 | 0.0 | 1.0256 | 0.0 | 1.0604 | | ElectraForCausalLM | 32 | 1.0002 | 0.9314 | 0.0 | 1.0343 | 0.0 | 1.382 | | AlbertForQuestionAnswering | 4 | 1.0004 | 1.0015 | 0.0 | 1.2337 | 0.0 | 1.243 | | AlbertForMaskedLM | 4 | 1.0007 | 0.9999 | 0.0 | 1.2316 | 0.0 | 1.2409 | | AllenaiLongformerBase | 1 | 0.9342 | 0.8536 | 0.775 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | pass | pass | pass | | BigBird | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | GoogleFnet | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | pass | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 4.8667 | 10.2825 | 34.0919 | 79.4779 | 94.4845 | 34.2982 | | DebertaForMaskedLM | 4 | 4.8161 | 10.4627 | 33.8165 | 81.3628 | 90.6026 | 33.4058 | | XGLMForCausalLM | 8 | 2.4597 | 9.9468 | 21.6593 | 231.0812 | 68.4101 | 65.3912 | | M2M100ForConditionalGeneration | 8 | 3.0769 | 11.7116 | 21.8081 | 315.1565 | 60.9574 | 57.4176 | | MobileBertForMaskedLM | 32 | 8.2418 | 23.7117 | 40.9471 | nan | 55.404 | 55.6769 | | MobileBertForQuestionAnswering | 64 | 8.254 | 23.3055 | 41.7309 | nan | 53.1692 | 50.8492 | | PegasusForConditionalGeneration | 16 | 2.8333 | 11.95 | 20.2551 | 343.1092 | 47.0151 | 42.1283 | | BartForConditionalGeneration | 2 | 3.0849 | 12.5952 | nan | 340.5584 | 45.8046 | 43.8262 | | MBartForConditionalGeneration | 16 | 2.9929 | 12.6985 | 21.4272 | nan | 45.4478 | 43.063 | | YituTechConvBert | 1 | 2.2658 | 8.1525 | 12.437 | 129.5594 | 39.6762 | 38.6469 | | BigBird | 1 | 7.5254 | 13.0826 | 24.8751 | nan | 39.627 | 24.9385 | | MegatronBertForCausalLM | 16 | 3.1134 | 10.4385 | 16.8143 | 239.5565 | 36.7295 | 33.6019 | | MegatronBertForQuestionAnswering | 16 | 3.1536 | 10.5129 | 16.6517 | 228.2878 | 35.9129 | 33.342 | | MT5ForConditionalGeneration | 8 | 3.709 | 10.9274 | 18.13 | 141.9175 | 35.2452 | 34.1912 | | BlenderbotSmallForConditionalGeneration | 64 | 1.9354 | 8.5412 | nan | 198.2294 | 32.2069 | 30.1424 | | T5ForConditionalGeneration | 4 | 2.376 | 7.5137 | 12.0802 | 93.0038 | 31.4461 | 30.0637 | | T5Small | 1 | 2.4237 | 7.5063 | 11.1985 | 86.4923 | 29.755 | 27.836 | | LayoutLMForSequenceClassification | 16 | 1.8973 | 5.867 | 8.8296 | 93.3368 | 28.3629 | 27.0014 | | PLBartForConditionalGeneration | 16 | 1.579 | 6.3694 | 9.9324 | 139.3419 | 27.7522 | 26.8652 | | PegasusForCausalLM | 32 | 1.2045 | 4.7715 | 7.7689 | 102.3546 | 22.5212 | 20.0551 | | ElectraForQuestionAnswering | 64 | 1.4899 | 5.2345 | nan | 101.9133 | 21.7784 | 20.289 | | BertForMaskedLM | 64 | 1.5142 | 5.5065 | 8.1783 | 102.5123 | 21.5075 | 20.7434 | | GoogleFnet | 1 | 0.9475 | 2.9802 | 9.143 | nan | 21.2596 | 13.5047 | | MBartForCausalLM | 32 | 1.126 | 4.7182 | 7.2157 | nan | 21.226 | 20.3334 | | RobertaForCausalLM | 64 | 1.537 | 5.2693 | 8.0838 | 108.4405 | 20.9157 | 19.7176 | | TrOCRForCausalLM | 32 | 1.1276 | 4.9577 | 7.1452 | 99.604 | 19.9331 | 19.0177 | | BartForCausalLM | 4 | 1.1861 | 4.7397 | 7.2735 | 101.1833 | 19.4947 | 18.5459 | | OPTForCausalLM | 32 | 1.2157 | 5.2868 | 8.6714 | 90.8138 | 19.0346 | 17.9527 | | CamemBert | 1 | 1.574 | 5.5269 | 7.582 | 111.1283 | 18.6234 | 17.9075 | | GPT2ForSequenceClassification | 4 | 1.5075 | 5.057 | nan | 84.3806 | 16.303 | 15.46 | | BlenderbotSmallForCausalLM | 64 | 0.8051 | 3.2453 | 5.1503 | 63.3528 | 14.5794 | 14.0186 | | Speech2Text2ForCausalLM | 128 | 0.7194 | 2.6425 | 4.4611 | 43.632 | 14.4999 | 13.148 | | PLBartForCausalLM | 32 | 0.65 | 2.5313 | 3.8731 | 52.5208 | 13.4832 | 12.8423 | | DistilBertForMaskedLM | 64 | 0.6252 | 2.619 | 4.5423 | 48.8768 | 12.9181 | 12.197 | | DistilBertForQuestionAnswering | 64 | 0.6355 | 2.7755 | 4.8137 | 47.1429 | 12.5154 | 11.7047 | | DistillGPT2 | 1 | 0.8054 | 2.7279 | 3.7019 | nan | 12.2819 | 12.5005 | | ElectraForCausalLM | 32 | 1.5242 | 5.434 | nan | 96.4277 | nan | 25.1256 | | LayoutLMForMaskedLM | 16 | 1.9149 | 5.981 | nan | 101.7411 | nan | 20.7418 | | BertForQuestionAnswering | 128 | 1.5314 | 5.5279 | nan | 101.2261 | nan | 19.5797 | | RobertaForQuestionAnswering | 128 | 1.5223 | 5.227 | nan | 101.1161 | nan | 18.7271 | | AlbertForMaskedLM | 4 | 1.1468 | 4.5759 | nan | 116.7989 | nan | 16.0497 | | AlbertForQuestionAnswering | 4 | 1.2813 | 4.7001 | nan | 108.5022 | nan | 15.7042 | | AllenaiLongformerBase | 1 | 6.131 | 13.3463 | 55.8287 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9343 | 0.9093 | nan | 1.1727 | 1.0595 | 1.1224 | | T5Small | 1 | 1.0 | 0.9029 | 0.3414 | 0.9118 | 0.8453 | 1.0257 | | T5ForConditionalGeneration | 4 | 1.0 | 0.9597 | 0.3543 | 0.972 | 0.8215 | 1.0502 | | BigBird | 1 | 0.9979 | 0.9534 | 0.4203 | nan | 0.8186 | 1.0078 | | XGLMForCausalLM | 8 | 0.9848 | 0.9267 | 0.3971 | 0.9742 | 0.8157 | 0.8962 | | DistillGPT2 | 1 | 0.9984 | 0.8113 | 0.3765 | nan | 0.8057 | 0.9257 | | YituTechConvBert | 1 | 0.9863 | 0.8581 | 0.3681 | 0.8984 | 0.791 | 0.8725 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9629 | 0.3704 | 1.0877 | 0.7893 | 0.9466 | | GoogleFnet | 1 | 0.9979 | 0.9451 | 0.3714 | nan | 0.7676 | 0.9364 | | M2M100ForConditionalGeneration | 8 | 1.0 | 0.9676 | 0.3658 | 0.9852 | 0.7661 | 0.9595 | | MT5ForConditionalGeneration | 8 | 1.0037 | 0.8873 | 0.4148 | 0.9335 | 0.7625 | 0.8497 | | CamemBert | 1 | 0.998 | 0.8252 | 0.3612 | 0.8613 | 0.7151 | 0.8696 | | BartForConditionalGeneration | 2 | 1.0 | 0.8935 | nan | 0.9759 | 0.6979 | 0.8969 | | PLBartForCausalLM | 32 | 0.9999 | 0.861 | 0.3948 | 0.9443 | 0.6852 | 0.806 | | PegasusForCausalLM | 32 | 0.9594 | 0.8885 | 0.3909 | 0.9963 | 0.6791 | 0.8947 | | BlenderbotSmallForCausalLM | 64 | 1.0 | 0.8401 | 0.3879 | 0.902 | 0.6618 | 0.7576 | | PLBartForConditionalGeneration | 16 | 0.9998 | 0.8959 | 0.3581 | 1.0146 | 0.6553 | 0.8256 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8671 | 0.3483 | 0.9908 | 0.6467 | 0.797 | | OPTForCausalLM | 32 | 0.9982 | 0.8656 | 0.3608 | 0.9159 | 0.6407 | 0.8248 | | BartForCausalLM | 4 | 1.0 | 0.9121 | 0.3643 | 0.9998 | 0.6359 | 0.8919 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8583 | 0.3438 | nan | 0.63 | 0.7668 | | MegatronBertForCausalLM | 16 | 0.9995 | 0.8826 | 0.352 | 0.9984 | 0.6276 | 0.7821 | | LayoutLMForSequenceClassification | 16 | 1.0 | 0.9348 | 0.3324 | 1.1087 | 0.6247 | 0.9889 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0 | 0.8975 | nan | 1.0067 | 0.6148 | 0.8546 | | TrOCRForCausalLM | 32 | 0.9999 | 0.8898 | 0.3743 | 0.9997 | 0.6094 | 0.7677 | | MBartForCausalLM | 32 | 0.9999 | 0.89 | 0.3743 | nan | 0.6078 | 0.7715 | | ElectraForQuestionAnswering | 64 | 1.0 | 0.9524 | nan | 1.1607 | 0.6054 | 0.9848 | | DistilBertForMaskedLM | 64 | 1.0 | 0.8899 | 0.3665 | 0.888 | 0.6017 | 0.8152 | | DistilBertForQuestionAnswering | 64 | 1.0 | 0.9373 | 0.3177 | 1.1317 | 0.595 | 0.7558 | | Speech2Text2ForCausalLM | 128 | 0.9552 | 0.8765 | 0.3524 | 0.908 | 0.5787 | 0.8128 | | BertForMaskedLM | 64 | 1.0 | 0.9219 | 0.3646 | 0.9904 | 0.5613 | 0.7534 | | RobertaForCausalLM | 64 | 0.9986 | 0.9206 | 0.3641 | 0.989 | 0.5604 | 0.7519 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.9103 | 0.3242 | nan | 0.4624 | 0.6 | | DebertaForMaskedLM | 4 | 1.0 | 0.9851 | 0.3551 | 0.9719 | 0.386 | 0.9713 | | MobileBertForQuestionAnswering | 64 | 1.0 | 0.984 | 0.2587 | nan | 0.3725 | 0.4638 | | DebertaForQuestionAnswering | 8 | 0.9816 | 1.063 | 0.3072 | 1.1591 | 0.2902 | 1.1588 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.9425 | nan | 0.7394 | nan | 1.2564 | | AlbertForMaskedLM | 4 | 1.0 | 0.9255 | nan | 0.7324 | nan | 1.2385 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9409 | nan | 0.9929 | nan | 0.9207 | | ElectraForCausalLM | 32 | 0.9983 | 0.8817 | nan | 0.844 | nan | 0.8074 | | BertForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | nan | 0.6814 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.968 | nan | 1.2359 | nan | 0.6814 | | AllenaiLongformerBase | 1 | 0.9982 | 0.9521 | 0.3207 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | BigBird | 1 | 183.2714 | 197.5589 | 177.5024 | nan | 173.0482 | 187.4254 | | BartForConditionalGeneration | 2 | 149.8208 | 151.6728 | nan | 336.8779 | 142.5464 | 143.5385 | | BartForCausalLM | 4 | 123.3493 | 127.703 | 163.3773 | 123.7473 | 120.8147 | 119.5945 | | BlenderbotSmallForConditionalGeneration | 64 | 118.9329 | 126.6233 | nan | 128.4823 | 119.7002 | 118.8634 | | MobileBertForQuestionAnswering | 64 | 128.3492 | 138.7587 | 162.1821 | nan | 99.8497 | 111.4061 | | BertForMaskedLM | 64 | 100.7764 | 105.0322 | 138.2727 | 103.5624 | 99.8353 | 99.8087 | | MBartForConditionalGeneration | 16 | 104.4648 | 116.9344 | 138.8334 | nan | 99.7629 | 101.2145 | | PegasusForConditionalGeneration | 16 | 104.239 | 105.8704 | 137.5009 | 116.63 | 99.7003 | 100.31 | | RobertaForCausalLM | 64 | 109.1567 | 113.1935 | 146.4955 | 112.0294 | 98.626 | 98.6817 | | ElectraForQuestionAnswering | 64 | 124.7299 | 126.6865 | nan | 100.7492 | 91.8775 | 92.9267 | | PegasusForCausalLM | 32 | 85.3565 | 89.4417 | 116.7522 | 91.1132 | 90.7424 | 89.1468 | | MBartForCausalLM | 32 | 85.7494 | 89.7854 | 117.3954 | nan | 90.3487 | 89.1164 | | LayoutLMForSequenceClassification | 16 | 113.1357 | 114.4238 | 153.3843 | 98.6608 | 90.1079 | 90.6208 | | TrOCRForCausalLM | 32 | 85.9849 | 90.3459 | 117.4732 | 90.6916 | 90.0196 | 88.748 | | DebertaForQuestionAnswering | 8 | 81.8412 | 83.5253 | 119.8841 | 94.156 | 78.8462 | 67.5561 | | T5ForConditionalGeneration | 4 | 103.8923 | 111.4283 | 144.527 | 90.9549 | 73.946 | 73.5219 | | MegatronBertForCausalLM | 16 | 78.0116 | 79.7026 | 108.4576 | 84.484 | 73.3289 | 74.1425 | | XGLMForCausalLM | 8 | 78.9395 | 85.5539 | 108.7486 | 246.683 | 70.8258 | 70.5198 | | MobileBertForMaskedLM | 32 | 128.1173 | 182.7003 | 115.8858 | nan | 69.6941 | 94.1241 | | BlenderbotSmallForCausalLM | 64 | 64.6693 | 71.0767 | 95.0869 | 70.8245 | 69.5204 | 67.2489 | | MegatronBertForQuestionAnswering | 16 | 73.0495 | 73.7601 | 98.4941 | 84.2401 | 67.8688 | 68.6586 | | M2M100ForConditionalGeneration | 8 | 85.6885 | 88.9342 | 101.8611 | 104.7103 | 66.6954 | 70.2398 | | DistilBertForMaskedLM | 64 | 63.2773 | 66.6006 | 88.866 | 99.7533 | 63.4887 | 62.292 | | GPT2ForSequenceClassification | 4 | 102.1552 | 104.6354 | nan | 145.3568 | 56.9777 | 57.4328 | | DebertaForMaskedLM | 4 | 65.101 | 76.2556 | 82.2715 | 92.8118 | 56.298 | 58.4277 | | OPTForCausalLM | 32 | 61.5938 | 69.4899 | 88.0024 | 135.7662 | 53.8281 | 52.9207 | | T5Small | 1 | 56.8455 | 60.0924 | 57.3452 | 56.0707 | 46.3697 | 50.4054 | | PLBartForConditionalGeneration | 16 | 48.916 | 50.741 | 61.2997 | 64.0417 | 43.0347 | 42.9541 | | PLBartForCausalLM | 32 | 41.2999 | 44.6621 | 57.8415 | 45.4471 | 42.7713 | 41.6652 | | MT5ForConditionalGeneration | 8 | 75.8863 | 82.1937 | 63.6739 | 74.1723 | 36.2703 | 41.5661 | | DistilBertForQuestionAnswering | 64 | 39.6801 | 40.4569 | 55.7742 | 75.9978 | 35.3504 | 35.8831 | | Speech2Text2ForCausalLM | 128 | 35.2243 | 37.7738 | 53.2739 | 37.972 | 34.1779 | 33.1294 | | YituTechConvBert | 1 | 48.8084 | 55.6075 | 28.3281 | 64.8499 | 16.3629 | 35.8636 | | CamemBert | 1 | 29.649 | 34.2262 | 23.3361 | 41.7762 | 13.9091 | 21.6496 | | GoogleFnet | 1 | 18.904 | 23.7467 | 19.7254 | nan | 10.9111 | 16.4785 | | DistillGPT2 | 1 | 16.8581 | 19.3686 | 16.4646 | nan | 8.8014 | 11.0413 | | AlbertForMaskedLM | 4 | 382.9366 | 383.2998 | nan | 311.0094 | nan | 309.1329 | | AlbertForQuestionAnswering | 4 | 380.456 | 379.8512 | nan | 307.8308 | nan | 306.3616 | | RobertaForQuestionAnswering | 128 | 147.856 | 148.8595 | nan | 143.8089 | nan | 139.7627 | | BertForQuestionAnswering | 128 | 147.3884 | 148.4158 | nan | 143.6026 | nan | 139.0456 | | LayoutLMForMaskedLM | 16 | 136.6459 | 141.1059 | nan | 126.0272 | nan | 117.5576 | | ElectraForCausalLM | 32 | 105.6065 | 113.55 | nan | 102.1777 | nan | 76.4817 | | AllenaiLongformerBase | 1 | 97.7054 | 108.4396 | 119.2147 | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with float32 precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | ghostnet_100 | 128 | 0.9992 | 0.973 | 0.8276 | 1.2729 | 1.7759 | 1.7411 | | lcnet_050 | 128 | 0.9569 | 0.9479 | 0.7669 | 1.3052 | 1.6059 | 1.5276 | | regnety_002 | 128 | 0.9813 | 0.998 | 0.8591 | 0.9645 | 1.5014 | 1.3138 | | dm_nfnet_f0 | 128 | 0.9991 | 1.0002 | 0.0 | 1.1375 | 1.4301 | 1.3818 | | hrnet_w18 | 128 | 1.0 | 0.9981 | 0.0 | 1.2838 | 1.3861 | 1.3485 | | nfnet_l0 | 128 | 1.0 | 0.7884 | 0.0 | 1.1032 | 1.3663 | 1.3201 | | xcit_large_24_p8_224 | 5 | 1.0031 | 0.9829 | 0.7746 | 0.0 | 1.3435 | 1.3129 | | volo_d1_224 | 64 | 0.9998 | 0.9944 | 0.8007 | 0.0 | 1.3433 | 1.3248 | | res2net50_14w_8s | 128 | 0.9999 | 0.9993 | 0.0 | 1.2521 | 1.3116 | 1.2777 | | mobilenetv3_large_100 | 128 | 0.966 | 0.9602 | 0.7644 | 1.2872 | 1.2976 | 1.3127 | | dla102 | 128 | 0.9999 | 1.0007 | 0.0 | 1.2838 | 1.2924 | 1.2801 | | resnest101e | 64 | 0.9999 | 1.0026 | 0.0 | 1.1685 | 1.2886 | 1.2443 | | gluon_inception_v3 | 128 | 1.0 | 0.999 | 0.0 | 1.1297 | 1.2826 | 1.2639 | | adv_inception_v3 | 128 | 0.9999 | 0.9985 | 0.0 | 1.1295 | 1.2822 | 1.2659 | | inception_v3 | 128 | 1.0001 | 0.9986 | 0.0 | 1.1292 | 1.2814 | 1.266 | | mobilenetv2_100 | 128 | 0.9653 | 0.9603 | 0.706 | 1.2765 | 1.2799 | 1.294 | | crossvit_9_240 | 128 | 1.0001 | 0.9976 | 0.7598 | 1.0396 | 1.2677 | 1.2404 | | res2next50 | 128 | 0.9998 | 1.0008 | 0.0 | 1.181 | 1.2623 | 1.2284 | | fbnetv3_b | 128 | 0.965 | 0.9608 | 0.7609 | 1.2334 | 1.257 | 1.2709 | | sebotnet33ts_256 | 64 | 0.9758 | 0.8045 | 0.0 | 0.0 | 1.2529 | 1.2484 | | coat_lite_mini | 128 | 0.9999 | 0.9819 | 0.8343 | 1.0762 | 1.2514 | 1.24 | | convit_base | 64 | 0.9997 | 0.9983 | 0.0 | 0.0 | 1.2511 | 1.2323 | | eca_botnext26ts_256 | 128 | 0.9869 | 0.7715 | 0.0 | 0.0 | 1.246 | 1.2393 | | tf_efficientnet_b0 | 128 | 0.9771 | 0.7837 | 0.0 | 1.1633 | 1.2454 | 1.2524 | | gmixer_24_224 | 128 | 0.9999 | 0.8088 | 0.0 | 1.0432 | 1.2429 | 1.2125 | | eca_halonext26ts | 128 | 0.9873 | 0.7782 | 0.0 | 0.0 | 1.237 | 1.2313 | | mnasnet_100 | 128 | 0.9663 | 0.9635 | 0.7855 | 1.2556 | 1.2256 | 1.2415 | | jx_nest_base | 32 | 1.0002 | 0.9926 | 0.7306 | 0.0 | 1.2243 | 1.1993 | | botnet26t_256 | 128 | 0.9856 | 0.9818 | 0.7869 | 0.0 | 1.2238 | 1.2349 | | fbnetc_100 | 128 | 0.9668 | 0.9621 | 0.7862 | 1.2447 | 1.2116 | 1.227 | | rexnet_100 | 128 | 0.9735 | 0.8163 | 0.0 | 1.1607 | 1.1996 | 1.2051 | | selecsls42b | 128 | 0.9999 | 0.9976 | 0.8158 | 1.2164 | 1.1957 | 1.1838 | | res2net101_26w_4s | 64 | 0.9999 | 0.9967 | 0.7721 | 1.0969 | 1.1929 | 1.1499 | | spnasnet_100 | 128 | 0.9619 | 0.9577 | 0.7755 | 1.2234 | 1.1913 | 1.2149 | | tinynet_a | 128 | 0.9649 | 0.7754 | 0.6201 | 1.1465 | 1.1861 | 1.1962 | | cspdarknet53 | 64 | 0.9577 | 0.9537 | 0.7358 | 1.1644 | 1.1714 | 1.182 | | gmlp_s16_224 | 128 | 0.9999 | 0.9492 | 0.0 | 1.0358 | 1.1691 | 1.1597 | | pnasnet5large | 16 | 0.9997 | 0.9981 | 0.0 | 1.0896 | 1.1683 | 1.1513 | | pit_b_224 | 64 | 1.0002 | 0.9991 | 0.0 | 1.0322 | 1.1644 | 1.1541 | | dpn107 | 32 | 0.9575 | 0.9495 | 0.7646 | 1.0264 | 1.1556 | 1.1663 | | mobilevit_s | 64 | 0.9797 | 0.7619 | 0.0 | 0.0 | 1.1531 | 1.1531 | | tf_mixnet_l | 128 | 0.9858 | 0.888 | 0.0 | 1.0939 | 1.1519 | 1.1513 | | mixnet_l | 128 | 0.9852 | 0.8857 | 0.0 | 1.0982 | 1.1414 | 1.1397 | | ese_vovnet19b_dw | 128 | 0.9789 | 0.9776 | 0.7441 | 1.1496 | 1.1394 | 1.1429 | | poolformer_m36 | 64 | 0.9999 | 0.9983 | 0.0 | 0.0 | 1.1291 | 1.1115 | | repvgg_a2 | 128 | 0.965 | 0.9618 | 0.8278 | 1.1373 | 1.1188 | 1.1198 | | cait_m36_384 | 4 | 1.0002 | 1.0264 | 0.0 | 0.0 | 1.1156 | 1.0969 | | twins_pcpvt_base | 64 | 1.0 | 0.9987 | 0.7499 | 0.0 | 1.0928 | 1.057 | | convnext_base | 64 | 0.9999 | 0.9987 | 0.0 | 0.0 | 1.0902 | 1.0921 | | swin_base_patch4_window7_224 | 64 | 1.0001 | 0.9787 | 0.0 | 0.0 | 1.0797 | 1.0722 | | swsl_resnext101_32x16d | 32 | 0.9997 | 1.0 | 0.0 | 1.1088 | 1.0765 | 1.0427 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9809 | 0.0 | 0.0 | 1.068 | 1.0565 | | convmixer_768_32 | 32 | 0.9998 | 0.9999 | 0.0 | 0.0 | 1.061 | 1.0582 | | deit_base_distilled_patch16_224 | 64 | 0.9998 | 0.9965 | 0.766 | 0.9802 | 1.0533 | 1.0434 | | gernet_l | 128 | 0.9744 | 0.9725 | 0.822 | 1.0986 | 1.0505 | 1.0467 | | mixer_b16_224 | 128 | 0.9998 | 0.9778 | 0.0 | 0.8748 | 1.0495 | 1.0438 | | gluon_xception65 | 32 | 0.9999 | 0.9947 | 0.0 | 1.0815 | 1.046 | 1.0352 | | vit_base_patch16_224 | 64 | 1.0001 | 0.9979 | 0.7672 | 0.9505 | 1.0453 | 1.0343 | | visformer_small | 128 | 0.9997 | 1.0023 | 0.7984 | 0.0 | 1.008 | 0.975 | | resmlp_12_224 | 128 | 1.0001 | 0.8543 | 0.6117 | 1.0055 | 0.8056 | 0.8487 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9993 | 0.0 | 0.0 | 0.0 | 1.4909 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | spnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convit_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | pass | pass | 0.0000 | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | fail_accuracy | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | gluon_xception65 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | resnest101e | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 5.8524 | 24.7857 | nan | 676.4224 | 122.4221 | 116.8456 | | twins_pcpvt_base | 64 | 2.0601 | 10.624 | 18.773 | nan | 114.1125 | 110.1489 | | mobilevit_s | 64 | 1.584 | 6.1928 | nan | nan | 95.1363 | 93.3466 | | pnasnet5large | 16 | 4.2518 | 18.1076 | nan | 362.6252 | 88.9917 | 86.7166 | | swin_base_patch4_window7_224 | 64 | 2.5203 | 10.2265 | nan | nan | 75.93 | 73.6605 | | xcit_large_24_p8_224 | 5 | 2.7617 | 13.768 | 26.2982 | nan | 73.706 | 70.1229 | | resnest101e | 64 | 3.0397 | 12.9271 | nan | 283.4243 | 72.014 | 68.7945 | | cait_m36_384 | 4 | 2.6703 | 14.7054 | nan | nan | 66.1667 | 60.8244 | | ghostnet_100 | 128 | 2.6213 | 8.079 | 11.8911 | 164.501 | 64.7761 | 63.8713 | | sebotnet33ts_256 | 64 | 1.57 | 5.2401 | nan | nan | 64.0233 | 62.3675 | | convnext_base | 64 | 1.2701 | 5.3049 | nan | nan | 63.8236 | 62.6481 | | dpn107 | 32 | 3.9276 | 11.7687 | 37.6121 | 175.5623 | 62.2273 | 59.734 | | res2net101_26w_4s | 64 | 2.8757 | 13.2769 | 23.0718 | 262.1833 | 61.2936 | 59.6663 | | fbnetv3_b | 128 | 3.0117 | 9.5552 | 24.949 | 246.3841 | 60.6271 | 58.976 | | eca_halonext26ts | 128 | 1.3348 | 4.4868 | nan | nan | 59.4878 | 57.7169 | | tinynet_a | 128 | 2.0869 | 6.7735 | 17.5518 | 145.6867 | 56.1743 | 54.417 | | res2net50_14w_8s | 128 | 2.6659 | 12.0616 | nan | 260.0086 | 55.9869 | 55.2216 | | jx_nest_base | 32 | 1.6024 | 7.5413 | 13.223 | nan | 54.2388 | 52.5432 | | poolformer_m36 | 64 | 1.7874 | 7.1976 | nan | nan | 51.9165 | 49.2788 | | coat_lite_mini | 128 | 1.0912 | 4.3994 | 6.76 | 97.1531 | 51.0652 | 49.5616 | | eca_botnext26ts_256 | 128 | 1.2702 | 4.345 | nan | nan | 50.114 | 48.534 | | dla102 | 128 | 1.6799 | 7.6835 | nan | 182.1761 | 49.4457 | 47.2264 | | fbnetc_100 | 128 | 1.9408 | 5.7401 | 15.3281 | 110.5064 | 46.478 | 45.6291 | | tf_mixnet_l | 128 | 5.58 | 11.5753 | nan | 152.5517 | 45.6904 | 43.7458 | | rexnet_100 | 128 | 1.7572 | 6.3012 | nan | 145.5679 | 45.5413 | 44.1144 | | gluon_xception65 | 32 | 1.7913 | 8.7275 | nan | 159.2909 | 45.4525 | 43.1065 | | tf_efficientnet_b0 | 128 | 1.7274 | 5.9016 | nan | 131.3366 | 45.1084 | 43.7668 | | spnasnet_100 | 128 | 1.9822 | 5.4607 | 15.4829 | 112.8651 | 44.6735 | 42.7039 | | mixnet_l | 128 | 5.2899 | 11.1227 | nan | 155.4238 | 43.8412 | 42.8375 | | botnet26t_256 | 128 | 1.3176 | 3.7007 | 8.557 | nan | 43.4725 | 42.3955 | | adv_inception_v3 | 128 | 1.4822 | 7.0568 | nan | 148.6479 | 43.0204 | 40.9051 | | mobilenetv2_100 | 128 | 1.717 | 4.9404 | 11.932 | 100.0364 | 42.3414 | 41.3316 | | mobilenetv3_large_100 | 128 | 1.6028 | 4.8121 | 11.7585 | 120.577 | 42.1799 | 41.6361 | | inception_v3 | 128 | 1.4705 | 6.8336 | nan | 147.2556 | 42.1677 | 40.5422 | | gluon_inception_v3 | 128 | 1.51 | 6.9013 | nan | 145.8997 | 42.0165 | 40.8396 | | mnasnet_100 | 128 | 1.6354 | 4.6077 | 11.9515 | 86.2567 | 41.4119 | 40.0444 | | swsl_resnext101_32x16d | 32 | 1.6081 | 7.4254 | nan | 128.128 | 41.0206 | 39.1007 | | gmlp_s16_224 | 128 | 1.0062 | 5.2141 | nan | 161.0214 | 39.5373 | 37.5352 | | dm_nfnet_f0 | 128 | 2.0675 | 6.2455 | nan | 155.008 | 38.6006 | 38.2463 | | crossvit_9_240 | 128 | 1.4166 | 6.5346 | 10.7345 | 174.628 | 37.9669 | 36.6945 | | volo_d1_224 | 64 | 1.1908 | 6.1561 | 10.4767 | nan | 36.7275 | 35.345 | | res2next50 | 128 | 1.4619 | 6.7122 | nan | 160.7283 | 35.6244 | 33.9977 | | cspdarknet53 | 64 | 2.1922 | 6.2063 | 17.0048 | 124.3809 | 33.1355 | 31.662 | | gmixer_24_224 | 128 | 1.0298 | 5.9337 | nan | 139.2252 | 31.983 | 29.7032 | | visformer_small | 128 | 0.9682 | 3.4582 | 5.3817 | nan | 31.4936 | 30.5379 | | nfnet_l0 | 128 | 1.8199 | 6.2944 | nan | 134.2882 | 31.4028 | 29.8499 | | regnety_002 | 128 | 1.538 | 4.5458 | 11.3716 | 95.6208 | 31.0369 | 29.4644 | | convit_base | 64 | 1.0813 | 4.687 | nan | nan | 28.2558 | 26.4626 | | lcnet_050 | 128 | 0.9841 | 2.8494 | 6.6264 | 70.064 | 26.4818 | 24.4527 | | selecsls42b | 128 | 0.7831 | 2.9897 | 4.872 | 73.3126 | 25.2941 | 24.4205 | | repvgg_a2 | 128 | 1.9159 | 5.2289 | 14.0015 | 161.7579 | 25.2209 | 24.1133 | | gernet_l | 128 | 1.8744 | 5.2213 | 13.9048 | 92.365 | 24.7888 | 23.6323 | | convmixer_768_32 | 32 | 1.1454 | 4.9034 | nan | nan | 24.378 | 22.0747 | | mixer_b16_224 | 128 | 0.6659 | 2.7996 | nan | 72.7332 | 22.4455 | 21.5582 | | resmlp_12_224 | 128 | 0.6385 | 2.2918 | 3.9644 | 31.7365 | 22.1846 | 20.5432 | | pit_b_224 | 64 | 0.8516 | 3.9006 | nan | 99.4544 | 19.8462 | 19.0873 | | beit_base_patch16_224 | 64 | 1.1871 | 4.416 | nan | nan | 19.6069 | 18.5526 | | deit_base_distilled_patch16_224 | 64 | 0.8352 | 3.6218 | 5.8694 | 73.2951 | 19.0626 | 18.3648 | | vit_base_patch16_224 | 64 | 0.8247 | 3.5595 | 5.581 | 76.6381 | 18.7264 | 17.9528 | | ese_vovnet19b_dw | 128 | 1.0055 | 2.5696 | 6.0034 | 55.8958 | 16.9724 | 15.96 | | tnt_s_patch16_224 | 128 | 1.5931 | 8.3557 | nan | nan | nan | 31.593 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | gmixer_24_224 | 128 | 0.9951 | 0.9185 | nan | 1.4758 | 1.5552 | 1.6267 | | tinynet_a | 128 | 0.9942 | 0.7796 | 0.2617 | 0.9898 | 1.351 | 1.5508 | | nfnet_l0 | 128 | 0.9931 | 0.8274 | nan | 0.7759 | 1.2915 | 1.4961 | | rexnet_100 | 128 | 0.9935 | 0.7843 | nan | 1.0507 | 1.2618 | 1.4494 | | tf_efficientnet_b0 | 128 | 0.9935 | 0.7688 | nan | 0.9895 | 1.206 | 1.3648 | | mobilevit_s | 64 | 0.9959 | 0.7668 | nan | nan | 1.1791 | 1.359 | | pnasnet5large | 16 | 1.069 | 1.011 | nan | 1.1917 | 1.1774 | 1.3459 | | mobilenetv2_100 | 128 | 0.9925 | 0.7621 | 0.3063 | 0.9861 | 1.1757 | 1.2677 | | cait_m36_384 | 4 | 0.9994 | 0.934 | nan | nan | 1.1135 | 1.1669 | | eca_halonext26ts | 128 | 0.9937 | 0.7687 | nan | nan | 1.1107 | 1.333 | | eca_botnext26ts_256 | 128 | 0.9938 | 0.7675 | nan | nan | 1.1105 | 1.3589 | | ghostnet_100 | 128 | 0.9865 | 0.8768 | 0.3273 | 0.9348 | 1.0674 | 1.2274 | | dla102 | 128 | 0.9831 | 0.9169 | nan | 0.953 | 1.0643 | 1.1582 | | poolformer_m36 | 64 | 0.9979 | 0.9511 | nan | nan | 1.0528 | 1.0694 | | ese_vovnet19b_dw | 128 | 0.9923 | 0.8877 | 0.3261 | 0.9303 | 1.039 | 1.1531 | | dm_nfnet_f0 | 128 | 0.9358 | 0.8935 | nan | 0.7593 | 1.0223 | 1.0902 | | resnest101e | 64 | 0.9971 | 0.9519 | nan | 0.9266 | 1.0033 | 1.1011 | | fbnetv3_b | 128 | 0.9932 | 0.7828 | 0.3095 | 0.9108 | 0.993 | 1.0394 | | convmixer_768_32 | 32 | 0.9986 | 0.9854 | nan | nan | 0.9848 | 0.997 | | selecsls42b | 128 | 0.9883 | 0.8896 | 0.337 | 0.8951 | 0.9783 | 1.0949 | | tf_mixnet_l | 128 | 0.9953 | 0.8572 | nan | 0.8574 | 0.9771 | 1.1452 | | gmlp_s16_224 | 128 | 0.9959 | 0.9487 | nan | 0.9833 | 0.9766 | 0.9827 | | beit_base_patch16_224 | 64 | 0.9966 | 0.9545 | nan | nan | 0.9672 | 1.0416 | | mixer_b16_224 | 128 | 0.9952 | 0.94 | nan | 1.4125 | 0.9647 | 1.0505 | | xcit_large_24_p8_224 | 5 | 0.9981 | 0.8982 | 0.3269 | nan | 0.9639 | 1.0517 | | volo_d1_224 | 64 | 0.996 | 0.9213 | 0.2948 | nan | 0.9626 | 1.0614 | | vit_base_patch16_224 | 64 | 0.9962 | 0.9435 | 0.3153 | 1.2305 | 0.961 | 1.0611 | | deit_base_distilled_patch16_224 | 64 | 0.9963 | 0.9441 | 0.3137 | 1.2337 | 0.9569 | 1.0577 | | gluon_xception65 | 32 | 0.9975 | 0.9365 | nan | 0.8929 | 0.9422 | 0.9943 | | mobilenetv3_large_100 | 128 | 0.9876 | 0.8589 | 0.3244 | 0.8112 | 0.9413 | 1.0423 | | convnext_base | 64 | 0.9975 | 0.9169 | nan | nan | 0.9403 | 0.9918 | | spnasnet_100 | 128 | 0.989 | 0.9109 | 0.3309 | 0.8412 | 0.939 | 0.9788 | | hrnet_w18 | 128 | 0.9954 | 0.9252 | nan | 0.8646 | 0.9388 | 1.0009 | | twins_pcpvt_base | 64 | 0.9976 | 0.9195 | 0.3132 | nan | 0.9369 | 1.074 | | mnasnet_100 | 128 | 0.9877 | 0.9019 | 0.3306 | 0.8279 | 0.9332 | 0.9758 | | res2net101_26w_4s | 64 | 0.9967 | 0.9277 | 0.3243 | 0.8933 | 0.929 | 0.9998 | | adv_inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0206 | | inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9137 | 1.0206 | | gluon_inception_v3 | 128 | 0.9902 | 0.8617 | nan | 0.8721 | 0.9136 | 1.0206 | | res2next50 | 128 | 0.9951 | 0.9153 | nan | 0.862 | 0.9081 | 1.0071 | | mixnet_l | 128 | 0.9951 | 0.845 | nan | 0.7911 | 0.907 | 1.0619 | | jx_nest_base | 32 | 1.0003 | 0.8968 | 0.2863 | nan | 0.9061 | 1.0576 | | dpn107 | 32 | 0.9985 | 0.9272 | 0.3392 | 0.8943 | 0.9059 | 0.9678 | | cspdarknet53 | 64 | 0.9954 | 0.8528 | 0.316 | 0.8912 | 0.9052 | 1.0578 | | fbnetc_100 | 128 | 0.9891 | 0.8518 | 0.3236 | 0.7446 | 0.9051 | 0.9874 | | visformer_small | 128 | 0.9943 | 0.9381 | 0.3293 | nan | 0.9035 | 0.9909 | | swsl_resnext101_32x16d | 32 | 0.9991 | 0.8973 | nan | 0.8676 | 0.8933 | 0.9945 | | lcnet_050 | 128 | 0.9672 | 0.7521 | 0.3171 | 0.8321 | 0.8842 | 0.9126 | | res2net50_14w_8s | 128 | 0.9952 | 0.9049 | nan | 0.8609 | 0.8824 | 1.0114 | | regnety_002 | 128 | 0.9717 | 0.8104 | 0.3283 | 0.7597 | 0.8622 | 1.0414 | | botnet26t_256 | 128 | 0.9915 | 0.8434 | 0.3165 | nan | 0.8605 | 0.9611 | | swin_base_patch4_window7_224 | 64 | 0.9976 | 0.9288 | nan | nan | 0.8514 | 1.0359 | | sebotnet33ts_256 | 64 | 0.9952 | 0.7085 | nan | nan | 0.8365 | 0.9651 | | pit_b_224 | 64 | 0.9968 | 0.7947 | nan | 1.0452 | 0.8169 | 1.0651 | | gernet_l | 128 | 0.9884 | 0.7892 | 0.32 | 0.7938 | 0.7928 | 0.9925 | | resmlp_12_224 | 128 | 0.9893 | 0.6396 | 0.2199 | 0.8133 | 0.7768 | 0.7845 | | coat_lite_mini | 128 | 1.0049 | 0.8526 | 0.3226 | 0.9857 | 0.7193 | 1.0063 | | convit_base | 64 | 0.9977 | 0.8838 | nan | nan | 0.6848 | 0.8081 | | crossvit_9_240 | 128 | 0.9884 | 0.8656 | 0.282 | 1.1496 | 0.579 | 0.7469 | | repvgg_a2 | 128 | 0.9867 | 0.8054 | 0.3277 | 0.657 | 0.5321 | 0.8171 | | tnt_s_patch16_224 | 128 | 0.996 | 0.9769 | nan | nan | nan | 0.7096 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 364.9543 | 364.983 | nan | nan | 344.1248 | 344.7035 | | hrnet_w18 | 128 | 416.9164 | 416.5317 | nan | 324.2178 | 300.3656 | 308.174 | | pnasnet5large | 16 | 289.0623 | 289.1281 | nan | 264.9068 | 247.1571 | 251.0074 | | convnext_base | 64 | 263.0282 | 263.2375 | nan | nan | 242.2678 | 241.9491 | | tf_mixnet_l | 128 | 256.5116 | 284.7739 | nan | 231.0685 | 219.6274 | 219.5795 | | swin_base_patch4_window7_224 | 64 | 236.9235 | 241.9457 | nan | nan | 219.5276 | 220.8572 | | mixnet_l | 128 | 247.0739 | 274.8611 | nan | 221.6435 | 213.2408 | 213.7331 | | dla102 | 128 | 269.4221 | 269.1747 | nan | 209.789 | 208.4293 | 210.2269 | | swsl_resnext101_32x16d | 32 | 219.2396 | 219.8738 | nan | 198.1904 | 203.7328 | 210.5139 | | cait_m36_384 | 4 | 216.4413 | 211.0196 | nan | nan | 193.9502 | 196.9494 | | resnest101e | 64 | 229.6236 | 229.7627 | nan | 196.6336 | 179.1017 | 184.3767 | | gluon_inception_v3 | 128 | 226.4534 | 226.5878 | nan | 200.503 | 176.7941 | 179.1665 | | inception_v3 | 128 | 226.5283 | 226.839 | nan | 200.5653 | 176.791 | 178.9429 | | adv_inception_v3 | 128 | 226.5925 | 226.949 | nan | 200.5762 | 176.5945 | 178.8762 | | res2net50_14w_8s | 128 | 229.1057 | 229.4153 | nan | 183.0654 | 174.9796 | 179.398 | | gluon_xception65 | 32 | 182.6818 | 183.5796 | nan | 168.8718 | 174.5198 | 176.1945 | | res2next50 | 128 | 206.5445 | 206.7144 | nan | 175.2548 | 163.6573 | 167.999 | | dpn107 | 32 | 191.081 | 192.5602 | 239.0929 | 178.0237 | 158.3459 | 156.7212 | | convit_base | 64 | 196.539 | 196.5806 | nan | nan | 156.9651 | 159.4282 | | poolformer_m36 | 64 | 174.1434 | 174.4587 | nan | nan | 154.4731 | 156.8458 | | gernet_l | 128 | 165.0785 | 165.2618 | 193.8264 | 146.4102 | 153.2905 | 153.5245 | | coat_lite_mini | 128 | 191.2439 | 194.8941 | 229.4872 | 177.8907 | 152.9867 | 154.3683 | | mixer_b16_224 | 128 | 158.1094 | 161.7932 | nan | 180.2299 | 150.8637 | 151.4858 | | dm_nfnet_f0 | 128 | 206.3302 | 206.1185 | nan | 180.6064 | 143.8748 | 149.1227 | | pit_b_224 | 64 | 158.318 | 158.5385 | nan | 153.3767 | 136.0492 | 137.075 | | eca_halonext26ts | 128 | 168.9988 | 214.7203 | nan | nan | 134.9935 | 135.699 | | gmlp_s16_224 | 128 | 151.9741 | 160.0814 | nan | 146.6642 | 130.0287 | 130.9741 | | eca_botnext26ts_256 | 128 | 163.215 | 208.9651 | nan | nan | 129.4263 | 129.9051 | | nfnet_l0 | 128 | 175.7541 | 222.9676 | nan | 159.7093 | 128.894 | 133.2002 | | res2net101_26w_4s | 64 | 151.7631 | 151.89 | 196.0614 | 138.4119 | 127.5719 | 131.7363 | | visformer_small | 128 | 128.115 | 128.0941 | 160.8298 | nan | 127.4066 | 131.4822 | | twins_pcpvt_base | 64 | 137.2853 | 137.3727 | 182.6034 | nan | 125.7299 | 129.7316 | | fbnetv3_b | 128 | 162.4339 | 163.0227 | 205.986 | 127.124 | 124.7635 | 123.2205 | | botnet26t_256 | 128 | 152.2055 | 152.7444 | 190.7205 | nan | 122.5998 | 121.4068 | | beit_base_patch16_224 | 64 | 128.4894 | 130.9318 | nan | nan | 120.4033 | 121.4405 | | gmixer_24_224 | 128 | 146.3047 | 180.771 | nan | 140.304 | 117.9067 | 120.6567 | | volo_d1_224 | 64 | 153.2234 | 153.8362 | 191.5709 | nan | 114.2712 | 115.6038 | | vit_base_patch16_224 | 64 | 118.9048 | 119.1472 | 155.2282 | 124.9761 | 113.8654 | 115.018 | | deit_base_distilled_patch16_224 | 64 | 119.6886 | 120.0077 | 156.4912 | 121.9975 | 113.7365 | 114.647 | | repvgg_a2 | 128 | 126.8952 | 127.6616 | 146.3533 | 107.7531 | 109.4986 | 109.5878 | | cspdarknet53 | 64 | 130.2386 | 130.7376 | 169.5338 | 107.3792 | 106.5293 | 105.4675 | | tf_efficientnet_b0 | 128 | 133.897 | 166.8934 | nan | 112.4995 | 105.1175 | 104.4808 | | xcit_large_24_p8_224 | 5 | 135.6696 | 137.7723 | 175.4112 | nan | 101.0928 | 103.7519 | | mobilevit_s | 64 | 116.9224 | 150.4377 | nan | nan | 99.394 | 99.3262 | | jx_nest_base | 32 | 121.1687 | 122.0531 | 166.0797 | nan | 99.151 | 100.9333 | | fbnetc_100 | 128 | 123.171 | 123.8397 | 151.7447 | 95.7404 | 98.4357 | 97.1598 | | rexnet_100 | 128 | 119.1503 | 142.0768 | nan | 99.7992 | 96.5981 | 96.1082 | | tinynet_a | 128 | 110.1328 | 136.9107 | 171.2114 | 92.5032 | 89.4353 | 88.6476 | | sebotnet33ts_256 | 64 | 114.3463 | 138.8066 | nan | nan | 89.1678 | 89.3116 | | resmlp_12_224 | 128 | 71.1278 | 83.2498 | 116.3323 | 70.7514 | 88.4623 | 83.8131 | | spnasnet_100 | 128 | 105.8021 | 106.3892 | 131.4471 | 83.1601 | 85.5594 | 83.7784 | | ese_vovnet19b_dw | 128 | 99.5261 | 99.7222 | 131.1557 | 84.8114 | 85.538 | 85.2615 | | mnasnet_100 | 128 | 98.5938 | 98.989 | 121.5525 | 75.9472 | 77.9418 | 76.783 | | crossvit_9_240 | 128 | 98.3245 | 98.4579 | 129.4003 | 94.4998 | 77.654 | 79.3055 | | selecsls42b | 128 | 89.6439 | 89.674 | 109.8597 | 73.7414 | 74.9165 | 75.6366 | | mobilenetv2_100 | 128 | 97.7729 | 98.2995 | 133.7161 | 73.894 | 73.7275 | 72.8836 | | ghostnet_100 | 128 | 114.53 | 117.6222 | 138.7011 | 90.023 | 64.6086 | 65.753 | | mobilenetv3_large_100 | 128 | 85.4723 | 86.0523 | 108.0978 | 64.1149 | 63.7019 | 62.8672 | | regnety_002 | 128 | 52.7547 | 51.7288 | 60.4367 | 54.3808 | 36.2979 | 39.4234 | | lcnet_050 | 128 | 38.2391 | 38.6247 | 47.7793 | 28.1054 | 22.8212 | 23.9672 | | tnt_s_patch16_224 | 128 | 470.5059 | 470.8671 | nan | nan | nan | 315.6218 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-10/huggingface_float32.png : ![](https://i.imgur.com/f9i51PI.png) ../test-dynamo-runner-logs-10/timm_models_float32.png : ![](https://i.imgur.com/VXUSxtR.png) ../test-dynamo-runner-logs-10/torchbench_float32.png : ![](https://i.imgur.com/pRwsviT.png)

williamwen42 commented 2 years ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 53/54 | 100%, 42/42 | 100%, 61/61 |
|       aot_eager        | 98%, 53/54 | 100%, 42/42 | 95%, 58/61  |
|     aot_cudagraphs     | 89%, 48/54 | 86%, 36/42  | 90%, 55/61  |
|    nvprims_nvfuser     | 61%, 33/54 |  12%, 5/42  | 54%, 33/61  |
|        inductor        | 83%, 45/54 | 93%, 39/42  | 92%, 56/61  |
| inductor_no_cudagraphs | 87%, 47/54 | 93%, 39/42  | 92%, 56/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.12x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.03x    |    1.09x    |
|        inductor        |   1.80x    |    1.73x    |    1.40x    |
| inductor_no_cudagraphs |   1.37x    |    1.51x    |    1.35x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.37    |    2.91     |    2.14     |
|       aot_eager        |    6.99    |    10.37    |    8.51     |
|     aot_cudagraphs     |   11.25    |    17.92    |    16.08    |
|    nvprims_nvfuser     |   67.63    |   131.40    |   148.18    |
|        inductor        |   34.25    |    38.38    |    43.61    |
| inductor_no_cudagraphs |   34.42    |    33.66    |    41.60    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.85x    |    0.90x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.39x    |    0.33x    |
|    nvprims_nvfuser     |   0.85x    |    1.04x    |    0.87x    |
|        inductor        |   0.83x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.96x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Warnings

We flag models where: - speedup < 0.95x - compilation latency > 120 sec. - compression ratio < 0.9 Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8647 | | torchbench | tacotron2 | 0.0 | 0.878 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6557 | 0.6462 | | timm_models | eca_halonext26ts | 0.0 | 1.1782 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 409.9545 | 414.0905 | | torchbench | timm_efficientdet | 145.611 | 141.3183 | | torchbench | hf_T5_large | 144.9492 | 140.076 | | timm_models | hrnet_w18 | 147.0405 | 140.208 | | timm_models | twins_pcpvt_base | 132.4825 | 127.3528 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8771 | 1.0948 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8658 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | hf_BigBird | 0.8122 | 1.096 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | soft_actor_critic | 0.7295 | 1.0367 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | hf_Reformer | 0.5295 | 0.9885 | | torchbench | functorch_dp_cifar10 | 0.4481 | 0.4691 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4114 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8829 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9882 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | BigBird | 0.8178 | 1.0597 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9516 | | huggingface | DistillGPT2 | 0.8048 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.7619 | 0.9892 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.0248 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1931 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | convnext_base | 0.806 | 0.9865 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8293 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-12/passrate_over_time.png : ![](https://i.imgur.com/gI4oQBf.png) ../test-dynamo-runner-logs-12/geomean_over_time.png : ![](https://i.imgur.com/u70MHrC.png)

Accuracy Regressions

For each relevant compiler, we compare the most recent 2 reports (that run actually the compiler) to find models where previously successful accuracy tests now fail. No accuracy regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0013 | 0.9244 | 2.5265 | 0.7568 | 6.0402 | 1.3272 | | functorch_dp_cifar10 | 64 | 0.9997 | 0.9524 | 2.3165 | 0.0 | 4.9474 | 0.9801 | | timm_efficientdet | 1 | 0.9848 | 0.801 | 2.1561 | 0.0 | 4.7302 | 1.5324 | | resnext50_32x4d | 8 | 0.9996 | 0.9602 | 1.8998 | 0.739 | 3.5177 | 1.2656 | | timm_vision_transformer | 8 | 1.0044 | 0.8373 | 1.8014 | 0.5969 | 3.3942 | 1.5493 | | BERT_pytorch | 16 | 1.0148 | 0.8418 | 1.5678 | 0.8769 | 3.3786 | 2.305 | | mobilenet_v3_large | 32 | 1.0036 | 1.0107 | 1.5136 | 0.7664 | 3.0732 | 1.3403 | | dcgan | 32 | 0.9855 | 0.918 | 1.673 | 0.7112 | 2.8788 | 1.0461 | | resnet18 | 16 | 1.0027 | 0.9975 | 1.6255 | 0.7984 | 2.7669 | 1.2575 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9968 | 1.0006 | 2.1273 | 0.0 | 2.7626 | 1.5765 | | mnasnet1_0 | 32 | 0.9981 | 1.0159 | 1.255 | 0.7607 | 2.6381 | 1.3589 | | hf_T5_large | 2 | 1.0203 | 0.8521 | 0.0 | 0.0 | 2.5854 | 2.1194 | | squeezenet1_1 | 32 | 0.9993 | 0.9617 | 1.4806 | 0.7231 | 2.4363 | 1.2946 | | hf_Albert | 8 | 1.0011 | 0.9563 | 0.7745 | 0.0 | 2.3301 | 2.2724 | | drq | 1 | 1.0036 | 0.8183 | 1.9651 | 0.606 | 2.2456 | 1.1662 | | timm_efficientnet | 32 | 0.9593 | 0.8087 | 1.0803 | 0.6803 | 2.1141 | 1.2811 | | pytorch_struct | 200 | 0.9838 | 0.7462 | 1.0256 | 0.5988 | 2.1094 | 1.2766 | | lennard_jones | 1000 | 0.9791 | 0.7673 | 1.2849 | 0.4702 | 2.0693 | 1.0591 | | resnet152 | 32 | 1.0013 | 0.998 | 1.336 | 0.0 | 2.0465 | 1.2915 | | hf_Bert | 4 | 1.0388 | 0.8618 | 0.9373 | 0.0 | 2.0423 | 1.8388 | | hf_GPT2 | 4 | 1.0207 | 0.9812 | 0.8235 | 0.2903 | 1.9367 | 1.9045 | | timm_resnest | 32 | 1.0036 | 1.0236 | 0.8381 | 0.9601 | 1.9163 | 1.6412 | | hf_T5 | 8 | 0.9994 | 0.9268 | 0.0 | 1.3491 | 1.8692 | 1.8706 | | LearningToPaint | 96 | 1.0035 | 1.0152 | 1.1845 | 0.8366 | 1.8434 | 1.3155 | | resnet50 | 32 | 1.0027 | 1.0133 | 1.0399 | 0.8018 | 1.7812 | 1.3571 | | hf_Bart | 4 | 1.0136 | 0.8225 | 0.8956 | 0.0 | 1.7309 | 1.6742 | | soft_actor_critic | 256 | 0.9887 | 0.7377 | 1.3219 | 0.5639 | 1.7086 | 1.0422 | | shufflenet_v2_x1_0 | 128 | 0.999 | 1.0153 | 0.9827 | 0.8516 | 1.6964 | 1.411 | | speech_transformer | 32 | 1.0064 | 0.8374 | 1.9738 | 0.0 | 1.6566 | 1.6858 | | mobilenet_v2 | 96 | 1.0001 | 0.9886 | 0.7619 | 1.0371 | 1.5606 | 1.5177 | | timm_nfnet | 128 | 0.9993 | 1.0004 | 0.8786 | 0.9189 | 1.5042 | 1.4323 | | attention_is_all_you_need_pytorch | 256 | 1.0116 | 0.8946 | 0.8337 | 0.0 | 1.5029 | 1.4721 | | fastNLP_Bert | 6 | 0.9985 | 0.8929 | 0.7669 | 0.0 | 1.4998 | 1.447 | | hf_DistilBert | 8 | 1.0017 | 0.9726 | 0.7436 | 0.3672 | 1.4902 | 1.4619 | | pytorch_stargan | 16 | 0.997 | 1.0956 | 1.037 | 0.0 | 1.4575 | 1.4 | | pytorch_unet | 1 | 0.9996 | 0.992 | 0.8644 | 1.0833 | 1.3606 | 1.325 | | timm_regnet | 32 | 0.9783 | 0.9425 | 0.9105 | 0.7862 | 1.3415 | 1.2274 | | timm_vovnet | 32 | 0.9231 | 0.8869 | 0.8692 | 0.805 | 1.3111 | 1.1493 | | vgg16 | 64 | 0.9996 | 0.9969 | 0.8573 | 0.9725 | 1.2682 | 1.2598 | | Background_Matting | 4 | 1.0001 | 1.0178 | 0.8972 | 1.058 | 1.2381 | 1.2194 | | Super_SloMo | 6 | 0.9999 | 0.9951 | 0.8854 | 0.0 | 1.2246 | 1.1996 | | alexnet | 128 | 0.9991 | 0.9965 | 0.8146 | 0.9277 | 1.2097 | 1.2065 | | hf_Reformer | 4 | 0.9989 | 0.9999 | 0.9915 | 0.6445 | 1.1757 | 1.1766 | | hf_BigBird | 2 | 0.9877 | 0.9075 | 1.052 | 0.8259 | 1.1481 | 1.0203 | | yolov3 | 16 | 0.9996 | 0.9907 | 0.8048 | 0.0 | 1.0897 | 1.0667 | | timm_vision_transformer_large | 8 | 1.0 | 0.9899 | 0.0 | 0.0 | 1.0857 | 1.0714 | | tts_angular | 64 | 0.9819 | 0.9398 | 0.9844 | 0.9606 | 1.0249 | 1.031 | | demucs | 4 | 1.0002 | 1.0009 | 1.0002 | 0.9995 | 0.9971 | 0.9977 | | nvidia_deeprecommender | 256 | 0.9989 | 0.9963 | 0.6974 | 1.0076 | 0.9897 | 1.0315 | | hf_GPT2_large | 4 | 1.0003 | 0.9901 | 0.0 | 0.0 | 0.0 | 1.8647 | | tacotron2 | 64 | 0.9828 | 0.7611 | 1.0078 | 0.5961 | 0.0 | 0.878 | | dlrm | 2048 | 1.0613 | 1.143 | 0.0 | 1.1546 | 0.0 | 0.0 | | hf_Longformer | 2 | 0.946 | 0.8629 | 0.8788 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | dlrm | 2 | pass | pass | fail_to_run | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_to_run | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_BigBird | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_Longformer | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | 0.0000 | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 2.986 | 8.2099 | 11.889 | nan | 409.9545 | 414.0905 | | timm_efficientdet | 1 | 20.0265 | 36.9998 | 76.7397 | nan | 145.611 | 141.3183 | | hf_T5_large | 2 | 14.5046 | 40.1668 | nan | nan | 144.9492 | 140.076 | | timm_vision_transformer_large | 8 | 2.8501 | 15.4164 | nan | nan | 70.2057 | 68.3106 | | resnet152 | 32 | 2.5516 | 13.2994 | 22.0977 | nan | 54.0068 | 51.6363 | | densenet121 | 4 | 2.3831 | 11.8802 | 18.9995 | 240.6643 | 52.2064 | 50.7329 | | hf_BigBird | 2 | 8.4521 | 15.3734 | 30.2618 | 118.7493 | 45.7988 | 30.6311 | | attention_is_all_you_need_pytorch | 256 | 1.3554 | 7.1003 | 11.3393 | nan | 40.2996 | 39.2159 | | timm_resnest | 32 | 0.6045 | 2.5325 | 3.8467 | 67.2264 | 39.5142 | 37.644 | | hf_Bart | 4 | 1.9627 | 8.946 | 13.7055 | nan | 36.0664 | 34.6433 | | BERT_pytorch | 16 | 1.7385 | 7.6326 | 11.4585 | 136.1247 | 36.0201 | 35.9605 | | timm_vision_transformer | 8 | 0.9797 | 4.5362 | 6.678 | 81.4047 | 35.7311 | 35.5038 | | speech_transformer | 32 | 1.9508 | 8.7873 | 35.6094 | nan | 35.1223 | 34.787 | | fastNLP_Bert | 6 | 1.7911 | 7.1716 | 11.5217 | nan | 32.8152 | 30.6284 | | timm_nfnet | 128 | 2.0722 | 7.0472 | 10.4867 | 166.4694 | 32.132 | 31.8605 | | hf_T5 | 8 | 2.5845 | 9.0702 | nan | 106.878 | 32.0813 | 30.9947 | | timm_regnet | 32 | 2.386 | 8.114 | 20.0107 | 140.3306 | 29.0709 | 27.4372 | | pytorch_stargan | 16 | 0.4181 | 1.9614 | 2.9017 | nan | 27.5989 | 27.0541 | | timm_efficientnet | 32 | 1.848 | 6.6989 | 15.6598 | 148.6479 | 27.4019 | 26.8408 | | mobilenet_v3_large | 32 | 0.9677 | 4.882 | 7.2605 | 119.9712 | 25.9116 | 26.1311 | | hf_Bert | 4 | 1.7987 | 7.1094 | 10.24 | nan | 24.3304 | 23.2744 | | pytorch_struct | 200 | 0.2811 | 0.8687 | 1.4828 | 8.0336 | 23.1547 | 22.9346 | | functorch_dp_cifar10 | 64 | 0.3215 | 1.4011 | 2.1491 | nan | 23.0946 | 22.9286 | | hf_Albert | 8 | 1.5578 | 6.5116 | 10.1464 | nan | 22.8098 | 21.6366 | | mnasnet1_0 | 32 | 0.8922 | 4.3141 | 6.4574 | 90.4439 | 21.5499 | 20.8696 | | resnet50 | 32 | 0.9393 | 4.4181 | 6.8179 | 96.5518 | 20.7144 | 19.9805 | | hf_GPT2 | 4 | 1.7289 | 6.6584 | 9.5573 | 113.9487 | 20.7112 | 19.7073 | | resnext50_32x4d | 8 | 0.9675 | 4.4711 | 6.8554 | 83.0174 | 20.6176 | 20.0897 | | shufflenet_v2_x1_0 | 128 | 1.0059 | 4.9223 | 7.5996 | 105.3812 | 20.4602 | 20.1959 | | timm_vovnet | 32 | 1.5844 | 4.4433 | 9.962 | 70.3932 | 20.26 | 19.6755 | | hf_Reformer | 4 | 1.7736 | 3.22 | 6.3397 | 17.9406 | 20.231 | 16.6282 | | mobilenet_v2 | 96 | 0.8701 | 4.4983 | 7.3185 | 114.7728 | 19.6414 | 19.1149 | | Background_Matting | 4 | 0.9131 | 4.28 | 6.523 | 90.0494 | 18.6434 | 17.5557 | | Super_SloMo | 6 | 0.9449 | 3.9538 | 5.7409 | nan | 16.8613 | 16.5423 | | hf_DistilBert | 8 | 0.7907 | 3.5131 | 5.98 | 64.96 | 15.3186 | 14.7961 | | resnet18 | 16 | 0.4469 | 1.7883 | 2.6099 | 38.1885 | 11.5609 | 11.3943 | | dcgan | 32 | 0.1768 | 0.4253 | 0.6639 | 5.106 | 10.7445 | 10.2986 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4606 | 1.997 | 3.0001 | nan | 9.3723 | 9.0812 | | pytorch_unet | 1 | 0.4009 | 1.7625 | 2.7148 | 38.8051 | 8.3979 | 8.4706 | | LearningToPaint | 96 | 0.4655 | 1.9306 | 2.8636 | 46.8568 | 8.1161 | 7.9582 | | squeezenet1_1 | 32 | 0.2708 | 0.9533 | 1.3976 | 7.0721 | 4.8464 | 4.6597 | | drq | 1 | 0.3197 | 0.6415 | 0.9465 | 6.3879 | 4.4333 | 3.7127 | | vgg16 | 64 | 0.2089 | 0.6521 | 1.108 | 5.6154 | 4.3608 | 3.9369 | | nvidia_deeprecommender | 256 | 0.2304 | 0.5217 | 0.8362 | 5.7489 | 3.6735 | 3.6354 | | soft_actor_critic | 256 | 0.2148 | 0.3631 | 0.6025 | 2.8431 | 3.5916 | 2.9843 | | alexnet | 128 | 0.1714 | 0.4391 | 0.7286 | 4.9607 | 3.4126 | 3.238 | | lennard_jones | 1000 | 0.16 | 0.3458 | 0.5274 | 2.9686 | 2.2194 | 2.0216 | | tts_angular | 64 | 0.2327 | 0.2802 | 0.4136 | 1.5294 | 2.0285 | 1.8136 | | demucs | 4 | 0.3516 | 0.3561 | 0.3483 | 0.3554 | 0.2585 | 0.2609 | | tacotron2 | 64 | 18.1878 | 32.6253 | 49.6128 | 108.9455 | nan | 66.9187 | | hf_GPT2_large | 4 | 5.7483 | 19.903 | nan | nan | nan | 58.0186 | | dlrm | 2048 | 0.4757 | 0.8648 | nan | 4.8303 | nan | nan | | hf_Longformer | 2 | 6.5165 | 14.4579 | 58.201 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2716 | 0.4638 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4815 | 1.0334 | 1.1302 | | hf_Albert | 8 | 0.9814 | 0.936 | 0.3267 | nan | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3513 | nan | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3083 | nan | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0142 | 0.9624 | 0.3723 | 0.9771 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9706 | 0.8847 | 0.38 | 1.1182 | 0.9649 | 1.1243 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0214 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9345 | 1.0307 | | hf_T5 | 8 | 0.9678 | 0.9331 | nan | 1.014 | 0.9304 | 1.2458 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3632 | nan | 0.9125 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3571 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3536 | nan | 0.9063 | 1.0466 | | speech_transformer | 32 | 1.0017 | 0.9174 | 0.3318 | nan | 0.9025 | 0.9069 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8357 | nan | nan | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0003 | 0.8825 | 0.4 | 1.1061 | 0.8771 | 1.0948 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3481 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 0.9844 | 0.8753 | 0.3903 | nan | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3562 | 0.7806 | 0.8658 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9505 | 0.8806 | 0.3413 | 1.0625 | 0.8348 | 0.9049 | | hf_BigBird | 2 | 0.9837 | 0.9784 | 0.454 | 1.2192 | 0.8122 | 1.096 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | hf_Bart | 4 | 0.9102 | 0.831 | 0.3635 | nan | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.341 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4253 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3886 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3407 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | LearningToPaint | 96 | 0.9293 | 0.7196 | 0.3826 | 0.6701 | 0.7295 | 0.925 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4736 | 0.9302 | 0.7295 | 1.0367 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3927 | 1.0881 | 0.7133 | 0.7227 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5125 | 0.5596 | 0.5596 | 0.5596 | | hf_Reformer | 4 | 0.9861 | 0.9861 | 0.5889 | 0.9861 | 0.5295 | 0.9885 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4481 | 0.4691 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.4994 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9582 | 0.8718 | nan | nan | nan | 1.1354 | | tacotron2 | 64 | 0.9866 | 0.4047 | 0.3142 | 0.3908 | nan | 0.4114 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_Longformer | 2 | 0.9734 | 0.967 | 0.349 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | hf_BigBird | 2 | 193.5142 | 215.0213 | 180.941 | 232.1067 | 174.4825 | 195.9982 | | timm_vision_transformer_large | 8 | 184.4533 | 186.6023 | nan | nan | 169.7636 | 172.2068 | | Background_Matting | 4 | 140.7573 | 131.3079 | 149.1193 | 126.3844 | 108.1172 | 109.6163 | | hf_T5 | 8 | 174.2596 | 188.4545 | nan | 129.2397 | 93.5287 | 93.1146 | | hf_T5_large | 2 | 219.4296 | 265.5191 | nan | nan | 89.4077 | 111.3612 | | timm_nfnet | 128 | 131.8974 | 131.5621 | 149.1212 | 143.2707 | 87.4397 | 91.8435 | | hf_Reformer | 4 | 82.3688 | 82.3877 | 83.1063 | 127.6202 | 70.0225 | 69.8226 | | Super_SloMo | 6 | 79.5184 | 79.831 | 89.627 | nan | 64.8616 | 66.4954 | | yolov3 | 16 | 68.6731 | 69.1744 | 85.2977 | nan | 63.0974 | 64.3576 | | demucs | 4 | 58.3956 | 56.7975 | 57.2131 | 57.0808 | 56.9051 | 57.1974 | | timm_regnet | 32 | 73.6642 | 77.428 | 81.0453 | 91.3126 | 54.8686 | 60.2097 | | vgg16 | 64 | 66.2586 | 66.2554 | 77.3586 | 68.1311 | 52.0933 | 52.4035 | | resnet152 | 32 | 93.19 | 93.7737 | 73.7146 | nan | 45.9847 | 73.7276 | | speech_transformer | 32 | 61.1574 | 72.128 | 33.6127 | nan | 43.9395 | 37.6152 | | fastNLP_Bert | 6 | 55.8713 | 62.4667 | 72.6506 | nan | 37.1736 | 38.8349 | | timm_efficientdet | 1 | 164.6247 | 201.062 | 76.0279 | nan | 35.8834 | 113.7971 | | attention_is_all_you_need_pytorch | 256 | 52.6102 | 59.4592 | 63.1451 | nan | 35.1067 | 36.1915 | | hf_Bart | 4 | 57.1086 | 80.685 | 65.4836 | nan | 34.09 | 35.5049 | | mobilenet_v2 | 96 | 49.0119 | 49.6415 | 64.7955 | 47.2468 | 31.4289 | 32.3183 | | pytorch_unet | 1 | 40.0107 | 40.291 | 46.2767 | 36.9016 | 29.3858 | 30.2302 | | hf_Albert | 8 | 68.4573 | 71.4457 | 88.4578 | nan | 29.338 | 30.0861 | | hf_GPT2 | 4 | 48.8154 | 50.6054 | 60.2533 | 170.1472 | 25.4075 | 25.9383 | | timm_vovnet | 32 | 34.9017 | 36.2533 | 37.2699 | 39.764 | 24.7339 | 28.4038 | | shufflenet_v2_x1_0 | 128 | 40.7603 | 39.7811 | 41.6895 | 49.0407 | 24.4348 | 29.4997 | | timm_efficientnet | 32 | 48.2682 | 57.4896 | 43.5439 | 70.0748 | 22.5159 | 38.0076 | | hf_Bert | 4 | 40.616 | 49.8664 | 44.0067 | nan | 21.4209 | 23.5838 | | hf_DistilBert | 8 | 31.1704 | 32.0659 | 41.9902 | 84.8307 | 20.883 | 21.2738 | | resnet50 | 32 | 33.7231 | 32.8461 | 32.4109 | 42.2118 | 19.3934 | 25.2693 | | BERT_pytorch | 16 | 55.1753 | 66.2267 | 35.177 | 75.064 | 16.9606 | 26.9736 | | timm_resnest | 32 | 24.5429 | 24.3387 | 29.4839 | 25.5992 | 12.883 | 15.6128 | | densenet121 | 4 | 80.2259 | 79.9039 | 29.2239 | 108.2873 | 12.8436 | 59.0698 | | mobilenet_v3_large | 32 | 36.0278 | 36.0424 | 24.0161 | 47.5741 | 12.031 | 28.971 | | mnasnet1_0 | 32 | 30.1033 | 29.1239 | 23.2042 | 39.2368 | 11.5173 | 22.2989 | | pytorch_stargan | 16 | 16.0613 | 14.7136 | 15.447 | nan | 10.9602 | 11.4272 | | nvidia_deeprecommender | 256 | 10.4205 | 10.4455 | 14.9032 | 10.3277 | 10.4795 | 10.0755 | | timm_vision_transformer | 8 | 29.8051 | 35.1811 | 16.5723 | 49.3166 | 9.9914 | 20.2384 | | resnext50_32x4d | 8 | 32.0002 | 29.9886 | 15.6452 | 39.8695 | 8.5186 | 23.8124 | | LearningToPaint | 96 | 14.8653 | 14.8909 | 12.8761 | 17.7788 | 8.1752 | 11.5419 | | alexnet | 128 | 9.8519 | 9.8482 | 12.045 | 10.5708 | 8.1102 | 8.1413 | | pytorch_CycleGAN_and_pix2pix | 1 | 19.7899 | 20.0679 | 10.4437 | nan | 6.7271 | 11.9278 | | tts_angular | 64 | 6.5592 | 6.728 | 6.8409 | 6.48 | 6.6216 | 6.3612 | | squeezenet1_1 | 32 | 14.8957 | 15.9578 | 10.1411 | 21.0218 | 6.2181 | 11.9763 | | resnet18 | 16 | 13.1999 | 13.0684 | 8.0429 | 17.0311 | 4.7406 | 10.907 | | functorch_dp_cifar10 | 64 | 14.087 | 14.6987 | 6.2992 | nan | 2.9563 | 14.9567 | | pytorch_struct | 200 | 4.6689 | 6.2131 | 4.5583 | 7.6996 | 2.2841 | 3.7849 | | drq | 1 | 3.9341 | 4.8462 | 2.0957 | 6.6531 | 2.0844 | 3.6121 | | dcgan | 32 | 3.1383 | 3.376 | 1.8628 | 4.4816 | 1.1143 | 3.121 | | soft_actor_critic | 256 | 1.3968 | 1.9155 | 1.0948 | 2.894 | 0.8735 | 1.4729 | | lennard_jones | 1000 | 1.4798 | 1.9008 | 1.1525 | 3.2362 | 0.7369 | 1.4765 | | tacotron2 | 64 | 3072.8083 | 3974.9816 | 2973.9841 | 5002.5454 | nan | 3572.2539 | | hf_GPT2_large | 4 | 209.5961 | 212.0102 | nan | nan | nan | 112.4447 | | dlrm | 2048 | 478.7127 | 458.1607 | nan | 491.3831 | nan | nan | | hf_Longformer | 2 | 125.9579 | 136.9207 | 134.7688 | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0237 | 0.8395 | 2.3483 | 0.0 | 4.7655 | 1.634 | | MobileBertForMaskedLM | 32 | 1.0198 | 0.8257 | 1.9714 | 0.0 | 4.2331 | 1.7709 | | CamemBert | 1 | 1.0303 | 0.8511 | 1.7797 | 0.0 | 3.6131 | 1.8111 | | MobileBertForQuestionAnswering | 64 | 1.014 | 0.8345 | 1.5223 | 0.0 | 3.5795 | 1.689 | | MT5ForConditionalGeneration | 8 | 1.0167 | 0.8732 | 1.5333 | 0.8689 | 3.502 | 2.4521 | | M2M100ForConditionalGeneration | 8 | 1.062 | 0.8219 | 1.4101 | 0.6772 | 2.6847 | 1.763 | | DistillGPT2 | 1 | 1.032 | 0.8685 | 1.3009 | 0.0 | 2.672 | 2.0346 | | GPT2ForSequenceClassification | 4 | 0.9989 | 0.9781 | 0.0 | 0.4956 | 2.3406 | 2.2765 | | PLBartForConditionalGeneration | 16 | 1.0122 | 0.833 | 1.0572 | 0.0 | 2.1955 | 1.6884 | | ElectraForQuestionAnswering | 64 | 1.0005 | 0.9786 | 0.7678 | 0.0 | 2.0363 | 1.9783 | | MegatronBertForQuestionAnswering | 16 | 1.0327 | 0.8563 | 1.0636 | 0.0 | 1.9461 | 1.8095 | | PegasusForConditionalGeneration | 16 | 1.0092 | 0.8278 | 1.0102 | 0.6649 | 1.9021 | 1.6245 | | LayoutLMForSequenceClassification | 16 | 1.0003 | 0.98 | 0.771 | 0.0 | 1.8068 | 1.76 | | MegatronBertForCausalLM | 16 | 1.0355 | 0.8556 | 1.0659 | 0.0 | 1.7781 | 1.7302 | | XGLMForCausalLM | 8 | 1.0092 | 0.8172 | 0.9339 | 0.0 | 1.7538 | 1.5743 | | ElectraForCausalLM | 32 | 0.9999 | 0.942 | 0.7154 | 0.0 | 1.7503 | 1.7577 | | T5Small | 1 | 1.0227 | 0.8739 | 1.2379 | 0.8595 | 1.6931 | 1.4283 | | AlbertForQuestionAnswering | 4 | 1.0001 | 0.8858 | 0.0 | 0.0 | 1.6465 | 1.6375 | | MBartForConditionalGeneration | 16 | 1.0117 | 0.8353 | 1.0059 | 0.0 | 1.6385 | 1.5836 | | AlbertForMaskedLM | 4 | 1.0002 | 0.8853 | 0.0 | 0.0 | 1.6336 | 1.6265 | | LayoutLMForMaskedLM | 16 | 1.0005 | 0.9716 | 0.7561 | 0.0 | 1.6 | 1.5806 | | T5ForConditionalGeneration | 4 | 1.0097 | 0.9188 | 0.7577 | 1.1627 | 1.5927 | 1.5683 | | OPTForCausalLM | 32 | 1.0081 | 0.9267 | 0.7734 | 0.3376 | 1.5203 | 1.507 | | Speech2Text2ForCausalLM | 128 | 1.0077 | 0.9336 | 0.7105 | 0.8046 | 1.502 | 1.5455 | | DistilBertForQuestionAnswering | 64 | 1.0013 | 0.9691 | 0.7434 | 0.3584 | 1.4506 | 1.4008 | | BertForQuestionAnswering | 128 | 1.0 | 0.9847 | 0.778 | 0.0 | 1.44 | 1.4143 | | RobertaForQuestionAnswering | 128 | 1.0 | 0.984 | 0.78 | 0.0 | 1.4221 | 1.4281 | | BartForConditionalGeneration | 2 | 1.0048 | 0.9691 | 0.0 | 0.0 | 1.4184 | 1.3902 | | BartForCausalLM | 4 | 1.0014 | 0.9703 | 0.7549 | 0.0 | 1.4114 | 1.4152 | | RobertaForCausalLM | 64 | 1.0006 | 0.9589 | 0.754 | 0.0 | 1.3991 | 1.3708 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0082 | 0.9214 | 0.7317 | 0.0 | 1.3851 | 1.3826 | | BertForMaskedLM | 64 | 1.0001 | 0.9579 | 0.741 | 0.0 | 1.295 | 1.285 | | DebertaForMaskedLM | 4 | 0.9109 | 0.7182 | 0.791 | 0.0 | 1.277 | 1.1467 | | PLBartForCausalLM | 32 | 1.0076 | 0.9316 | 0.7962 | 0.8381 | 1.2503 | 1.2439 | | BlenderbotSmallForCausalLM | 64 | 1.0027 | 0.9252 | 0.7199 | 0.0 | 1.214 | 1.2216 | | DistilBertForMaskedLM | 64 | 1.0012 | 0.9514 | 0.7098 | 0.464 | 1.2109 | 1.2132 | | MBartForCausalLM | 32 | 1.0036 | 0.9457 | 0.7568 | 0.0 | 1.1686 | 1.1681 | | TrOCRForCausalLM | 32 | 1.0033 | 0.9491 | 0.7603 | 0.0 | 1.1604 | 1.1608 | | BigBird | 1 | 0.9871 | 0.9028 | 1.0378 | 0.834 | 1.1463 | 1.0256 | | PegasusForCausalLM | 32 | 0.9993 | 0.9521 | 0.7506 | 0.8474 | 1.14 | 1.1399 | | DebertaForQuestionAnswering | 8 | 0.9835 | 0.8609 | 0.7227 | 0.0 | 1.1358 | 1.2078 | | AllenaiLongformerBase | 1 | 0.9387 | 0.7199 | 0.8541 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ | BigBird | 1 | pass | pass | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AllenaiLongformerBase | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------+-----------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForQuestionAnswering | 8 | 5.5509 | 11.4091 | 35.6433 | nan | 100.9589 | 41.964 | | DebertaForMaskedLM | 4 | 5.2629 | 11.411 | 35.9631 | nan | 100.2137 | 39.9969 | | MobileBertForMaskedLM | 32 | 9.6476 | 32.5736 | 58.6878 | nan | 85.117 | 81.5045 | | MobileBertForQuestionAnswering | 64 | 9.5963 | 32.7644 | 57.7499 | nan | 82.6893 | 79.6116 | | XGLMForCausalLM | 8 | 3.0031 | 13.5059 | 28.1929 | nan | 81.6079 | 78.3857 | | M2M100ForConditionalGeneration | 8 | 3.534 | 17.4328 | 23.4747 | 451.2527 | 71.0844 | 70.5961 | | MBartForConditionalGeneration | 16 | 3.9109 | 17.4035 | 30.4818 | nan | 61.2256 | 59.6837 | | PegasusForConditionalGeneration | 16 | 3.5423 | 17.3276 | 28.7618 | 443.4726 | 60.5842 | 55.559 | | BartForConditionalGeneration | 2 | 3.685 | 17.207 | nan | nan | 59.6008 | 57.8341 | | YituTechConvBert | 1 | 2.6984 | 11.1228 | 17.2562 | nan | 52.7018 | 49.513 | | MegatronBertForCausalLM | 16 | 4.0451 | 14.5971 | 23.6376 | nan | 48.6048 | 47.2939 | | MegatronBertForQuestionAnswering | 16 | 3.9064 | 14.5335 | 22.3992 | nan | 48.0185 | 46.3525 | | BigBird | 1 | 8.5743 | 16.0382 | 30.2992 | 130.6301 | 45.614 | 30.0005 | | MT5ForConditionalGeneration | 8 | 3.8176 | 13.1385 | 21.2695 | 175.0247 | 44.4971 | 42.2784 | | BlenderbotSmallForConditionalGeneration | 64 | 2.5149 | 11.4483 | 18.5728 | nan | 40.3107 | 38.6924 | | PLBartForConditionalGeneration | 16 | 2.0182 | 8.9426 | 13.8708 | nan | 33.674 | 33.2809 | | T5ForConditionalGeneration | 4 | 2.567 | 8.8282 | 13.7986 | 107.916 | 33.5876 | 32.6527 | | T5Small | 1 | 2.7883 | 9.0334 | 13.6054 | 108.3692 | 33.106 | 32.177 | | LayoutLMForSequenceClassification | 16 | 2.1682 | 7.761 | 11.9527 | nan | 31.1275 | 30.5897 | | ElectraForCausalLM | 32 | 1.854 | 7.1987 | 11.2208 | nan | 30.516 | 28.4135 | | PegasusForCausalLM | 32 | 1.4663 | 6.5667 | 10.4479 | 131.8655 | 26.875 | 24.8176 | | LayoutLMForMaskedLM | 16 | 2.312 | 7.6467 | 11.5922 | nan | 25.7474 | 24.6541 | | MBartForCausalLM | 32 | 1.3713 | 6.6326 | 9.7276 | nan | 25.2935 | 24.5115 | | RobertaForCausalLM | 64 | 1.7844 | 7.2632 | 10.5112 | nan | 25.0474 | 24.5256 | | BertForMaskedLM | 64 | 1.7845 | 7.081 | 10.7864 | nan | 24.8506 | 23.8476 | | ElectraForQuestionAnswering | 64 | 1.8663 | 7.1957 | 10.7599 | nan | 24.5665 | 22.9424 | | TrOCRForCausalLM | 32 | 1.4244 | 6.6433 | 10.1009 | nan | 24.1959 | 23.3682 | | OPTForCausalLM | 32 | 1.4683 | 6.7404 | 11.3585 | 135.0538 | 23.9159 | 22.7504 | | BartForCausalLM | 4 | 1.4614 | 6.6209 | 10.0027 | nan | 23.8581 | 22.728 | | BertForQuestionAnswering | 128 | 1.8 | 7.0957 | 10.5926 | nan | 23.2765 | 22.855 | | RobertaForQuestionAnswering | 128 | 1.8338 | 7.1811 | 10.9546 | nan | 22.6048 | 21.6113 | | CamemBert | 1 | 1.8379 | 7.1804 | 10.1528 | nan | 21.6357 | 21.4157 | | AlbertForMaskedLM | 4 | 1.6264 | 6.8015 | nan | nan | 21.5279 | 20.2271 | | GPT2ForSequenceClassification | 4 | 1.7212 | 7.0176 | nan | 118.8624 | 20.8794 | 19.905 | | AlbertForQuestionAnswering | 4 | 1.6298 | 6.861 | nan | nan | 20.536 | 19.2856 | | BlenderbotSmallForCausalLM | 64 | 0.9946 | 4.3598 | 6.6193 | nan | 17.7714 | 16.8588 | | Speech2Text2ForCausalLM | 128 | 0.8583 | 3.3864 | 5.6004 | 60.3247 | 16.3413 | 15.6323 | | PLBartForCausalLM | 32 | 0.8156 | 3.4837 | 5.0119 | 74.5061 | 15.3272 | 14.9834 | | DistilBertForMaskedLM | 64 | 0.7676 | 3.4798 | 5.7938 | 62.713 | 14.7041 | 14.3685 | | DistilBertForQuestionAnswering | 64 | 0.7742 | 3.4831 | 5.7762 | 70.4555 | 14.3329 | 13.9555 | | DistillGPT2 | 1 | 0.9204 | 3.4315 | 4.6372 | nan | 13.6493 | 13.9599 | | AllenaiLongformerBase | 1 | 7.1245 | 15.6178 | 59.4162 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 0.9675 | 0.9164 | nan | 1.1872 | 1.0783 | 1.1637 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 0.9998 | 0.7431 | nan | nan | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9996 | 0.9527 | 0.3625 | 1.0964 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9973 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1 | 0.9327 | 0.9847 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9628 | 0.4377 | 1.1462 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 0.9999 | 0.9238 | 0.3662 | nan | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | nan | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 0.9998 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.0551 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3786 | nan | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 0.9991 | 0.8993 | 0.3788 | nan | 0.8829 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 0.9996 | 0.8679 | 0.3724 | 1.0333 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4146 | nan | 0.8523 | 0.9882 | | DistilBertForMaskedLM | 64 | 0.9999 | 0.8599 | 0.3635 | 1.0791 | 0.8215 | 0.8801 | | BigBird | 1 | 1.0008 | 0.9533 | 0.4483 | 1.1342 | 0.8178 | 1.0597 | | CamemBert | 1 | 0.9989 | 0.8143 | 0.416 | nan | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8055 | 0.9516 | | DistillGPT2 | 1 | 0.9963 | 0.8033 | 0.4018 | nan | 0.8048 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9974 | 0.848 | 0.3928 | nan | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | 0.4315 | nan | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9996 | 0.8172 | 0.3687 | nan | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 1.0018 | 0.9401 | 0.4655 | 1.0279 | 0.7619 | 0.9892 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9816 | 0.3623 | nan | 0.409 | 1.0248 | | DebertaForQuestionAnswering | 8 | 0.9754 | 1.0737 | 0.3252 | nan | 0.3071 | 1.1931 | | AllenaiLongformerBase | 1 | 0.9977 | 0.9476 | 0.3852 | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | BigBird | 1 | 187.6644 | 205.1841 | 179.6176 | 233.3983 | 169.5433 | 188.7012 | | AlbertForMaskedLM | 4 | 266.964 | 301.5164 | nan | nan | 163.7461 | 164.3171 | | AlbertForQuestionAnswering | 4 | 264.9085 | 299.0791 | nan | nan | 161.2566 | 161.9879 | | BartForConditionalGeneration | 2 | 135.6302 | 140.6238 | nan | nan | 95.8847 | 97.8919 | | BlenderbotSmallForConditionalGeneration | 64 | 115.6126 | 120.4467 | 150.6582 | nan | 80.0487 | 79.9002 | | BartForCausalLM | 4 | 112.3509 | 116.0284 | 146.8259 | nan | 79.6447 | 79.2428 | | RobertaForQuestionAnswering | 128 | 111.2 | 112.9751 | 142.8487 | nan | 78.3717 | 78.024 | | BertForQuestionAnswering | 128 | 110.8113 | 112.4935 | 142.404 | nan | 77.0375 | 78.3569 | | LayoutLMForMaskedLM | 16 | 112.0833 | 115.391 | 148.3155 | nan | 70.1767 | 71.0073 | | MBartForConditionalGeneration | 16 | 114.7877 | 127.4449 | 116.3129 | nan | 67.0647 | 70.6291 | | PegasusForConditionalGeneration | 16 | 119.9187 | 128.6557 | 115.046 | 182.5102 | 66.9586 | 77.6484 | | DebertaForQuestionAnswering | 8 | 76.3551 | 87.1018 | 104.0592 | nan | 66.1598 | 62.0697 | | T5ForConditionalGeneration | 4 | 101.8154 | 110.4905 | 135.6551 | 87.2791 | 63.6223 | 64.4705 | | PegasusForCausalLM | 32 | 68.9452 | 72.0783 | 91.9867 | 81.3736 | 60.4264 | 60.4013 | | TrOCRForCausalLM | 32 | 69.8089 | 73.4858 | 92.396 | nan | 60.3591 | 60.1918 | | MBartForCausalLM | 32 | 69.6166 | 74.0875 | 92.5437 | nan | 60.1754 | 60.517 | | BertForMaskedLM | 64 | 75.6831 | 78.9888 | 102.1711 | nan | 58.5214 | 58.7758 | | RobertaForCausalLM | 64 | 80.3695 | 84.0892 | 106.977 | nan | 57.5876 | 58.8021 | | ElectraForQuestionAnswering | 64 | 115.4395 | 117.1516 | 149.2721 | nan | 56.4106 | 57.9649 | | LayoutLMForSequenceClassification | 16 | 97.2309 | 99.245 | 126.4202 | nan | 53.9048 | 55.2666 | | MobileBertForQuestionAnswering | 64 | 217.619 | 229.2579 | 119.8433 | nan | 53.7946 | 135.5854 | | XGLMForCausalLM | 8 | 100.3168 | 126.6184 | 94.0604 | nan | 52.9548 | 58.8329 | | DebertaForMaskedLM | 4 | 68.2336 | 86.3616 | 79.1363 | nan | 50.0515 | 56.2138 | | ElectraForCausalLM | 32 | 87.5979 | 92.6875 | 122.2111 | nan | 49.8263 | 49.7447 | | M2M100ForConditionalGeneration | 8 | 101.5907 | 131.7258 | 77.3277 | 157.8869 | 48.8665 | 64.8224 | | BlenderbotSmallForCausalLM | 64 | 58.5989 | 63.5229 | 81.6708 | nan | 48.4749 | 48.1104 | | MegatronBertForCausalLM | 16 | 80.3345 | 96.6602 | 84.468 | nan | 47.369 | 49.7928 | | MobileBertForMaskedLM | 32 | 199.1971 | 218.5416 | 106.0787 | nan | 43.5386 | 108.6081 | | MegatronBertForQuestionAnswering | 16 | 78.6982 | 97.6488 | 76.675 | nan | 43.5378 | 47.6163 | | GPT2ForSequenceClassification | 4 | 92.1763 | 93.363 | nan | 183.2359 | 39.3505 | 39.8508 | | T5Small | 1 | 61.6181 | 82.356 | 53.5356 | 73.1401 | 39.1234 | 45.4711 | | DistilBertForMaskedLM | 64 | 45.3949 | 47.6605 | 63.8108 | 97.7453 | 37.3816 | 37.292 | | OPTForCausalLM | 32 | 53.7941 | 58.5438 | 69.973 | 159.7347 | 35.5924 | 36.0922 | | PLBartForCausalLM | 32 | 39.0616 | 42.5041 | 49.417 | 46.2015 | 31.7223 | 31.7841 | | PLBartForConditionalGeneration | 16 | 55.4083 | 68.5015 | 53.3969 | nan | 30.8411 | 34.5694 | | MT5ForConditionalGeneration | 8 | 87.3198 | 111.7102 | 58.2897 | 101.3926 | 26.6963 | 43.399 | | DistilBertForQuestionAnswering | 64 | 30.4897 | 31.5527 | 41.1213 | 85.404 | 21.1599 | 21.8061 | | Speech2Text2ForCausalLM | 128 | 30.3355 | 32.6945 | 42.0915 | 39.1145 | 20.5729 | 21.3091 | | YituTechConvBert | 1 | 62.3444 | 77.2299 | 27.3558 | nan | 13.8567 | 41.1748 | | CamemBert | 1 | 45.0132 | 46.7437 | 21.911 | nan | 11.2469 | 22.666 | | DistillGPT2 | 1 | 19.7998 | 24.2174 | 16.2129 | nan | 8.034 | 11.9778 | | AllenaiLongformerBase | 1 | 97.0497 | 119.1548 | 98.8809 | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | xcit_large_24_p8_224 | 5 | 1.0019 | 0.0 | 0.0 | 0.0 | 2.1299 | 1.8479 | | regnety_002 | 128 | 0.9789 | 0.9501 | 1.133 | 0.8608 | 2.1203 | 1.4366 | | ghostnet_100 | 128 | 1.0044 | 0.9793 | 0.8848 | 1.0146 | 2.1057 | 1.7877 | | lcnet_050 | 128 | 0.9699 | 0.9498 | 0.8461 | 1.0328 | 2.0413 | 1.5644 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9984 | 0.0 | 0.0 | 1.921 | 1.8915 | | twins_pcpvt_base | 64 | 1.0049 | 0.9172 | 0.9443 | 0.0 | 1.6974 | 1.6298 | | res2net101_26w_4s | 64 | 1.0018 | 1.0071 | 0.9594 | 0.0 | 1.6079 | 1.3135 | | coat_lite_mini | 128 | 1.0001 | 0.9889 | 0.842 | 1.152 | 1.6042 | 1.5691 | | hrnet_w18 | 128 | 1.0029 | 1.022 | 0.8623 | 0.0 | 1.5941 | 1.4736 | | dla102 | 128 | 1.0001 | 0.9962 | 0.839 | 1.3133 | 1.5843 | 1.5507 | | volo_d1_224 | 64 | 0.9998 | 0.994 | 0.8449 | 0.0 | 1.5565 | 1.5177 | | nfnet_l0 | 128 | 0.9992 | 0.8102 | 0.7105 | 0.8485 | 1.5493 | 1.4689 | | gmlp_s16_224 | 128 | 0.9998 | 0.9961 | 0.787 | 1.0101 | 1.5307 | 1.5026 | | resnest101e | 64 | 1.0022 | 0.9909 | 0.811 | 0.0 | 1.5223 | 1.4598 | | adv_inception_v3 | 128 | 0.9999 | 0.996 | 0.8517 | 1.1431 | 1.5058 | 1.4728 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 0.8538 | 1.1354 | 1.5051 | 1.4731 | | dm_nfnet_f0 | 128 | 0.9987 | 0.9944 | 0.881 | 0.9234 | 1.503 | 1.4303 | | inception_v3 | 128 | 1.0 | 0.9961 | 0.8533 | 1.1438 | 1.5021 | 1.4669 | | gmixer_24_224 | 128 | 1.0001 | 0.8804 | 0.7216 | 0.9232 | 1.4787 | 1.4648 | | res2net50_14w_8s | 128 | 1.0 | 0.9938 | 0.8099 | 0.9962 | 1.4757 | 1.417 | | mobilenetv3_large_100 | 128 | 0.9552 | 0.9515 | 0.7829 | 0.9758 | 1.4584 | 1.4338 | | swin_base_patch4_window7_224 | 64 | 0.9997 | 0.9604 | 0.0 | 0.0 | 1.4454 | 1.4213 | | selecsls42b | 128 | 0.9997 | 0.9952 | 0.8418 | 1.2864 | 1.4421 | 1.4054 | | fbnetv3_b | 128 | 0.9538 | 0.9405 | 0.7895 | 0.0 | 1.4417 | 1.4044 | | mnasnet_100 | 128 | 0.9541 | 0.9434 | 0.7891 | 1.2009 | 1.4326 | 1.4576 | | cait_m36_384 | 4 | 1.0 | 1.0101 | 0.0 | 0.0 | 1.417 | 1.3655 | | res2next50 | 128 | 0.9994 | 0.9962 | 0.8317 | 1.146 | 1.4114 | 1.3462 | | mobilenetv2_100 | 128 | 0.9529 | 0.9365 | 0.723 | 1.119 | 1.403 | 1.4335 | | crossvit_9_240 | 128 | 1.0001 | 0.9896 | 0.8348 | 0.92 | 1.3926 | 1.3689 | | ese_vovnet19b_dw | 128 | 0.9705 | 0.9648 | 0.7674 | 1.1287 | 1.3771 | 1.3792 | | mobilevit_s | 64 | 0.9738 | 0.8154 | 0.6558 | 0.0 | 1.373 | 1.361 | | spnasnet_100 | 128 | 0.9468 | 0.9388 | 0.7741 | 1.1045 | 1.3696 | 1.3885 | | jx_nest_base | 32 | 0.9998 | 0.9934 | 0.8023 | 0.0 | 1.3606 | 1.3257 | | fbnetc_100 | 128 | 0.9538 | 0.9417 | 0.7935 | 1.1628 | 1.3518 | 1.3773 | | convit_base | 64 | 1.0001 | 0.9948 | 0.833 | 1.236 | 1.3376 | 1.3288 | | tf_efficientnet_b0 | 128 | 0.9657 | 0.8084 | 0.6672 | 0.9535 | 1.3375 | 1.3567 | | pnasnet5large | 16 | 1.0058 | 1.0269 | 0.8504 | 0.0 | 1.3292 | 1.2818 | | poolformer_m36 | 64 | 0.9996 | 0.9963 | 0.8015 | 0.0 | 1.3265 | 1.2939 | | botnet26t_256 | 128 | 0.9799 | 0.9753 | 0.8121 | 1.2792 | 1.3239 | 1.3309 | | pit_b_224 | 64 | 0.9998 | 0.9953 | 0.8222 | 0.9717 | 1.3135 | 1.3091 | | cspdarknet53 | 64 | 0.9425 | 0.9349 | 0.7567 | 1.1387 | 1.3091 | 1.3176 | | resmlp_12_224 | 128 | 1.0004 | 0.9991 | 0.7824 | 1.4845 | 1.2818 | 1.264 | | rexnet_100 | 128 | 0.9653 | 0.8512 | 0.6901 | 0.0 | 1.2798 | 1.2784 | | eca_botnext26ts_256 | 128 | 0.98 | 0.8102 | 0.6694 | 1.0727 | 1.2734 | 1.2701 | | tinynet_a | 128 | 0.9577 | 0.8037 | 0.6527 | 0.7859 | 1.2703 | 1.2808 | | mixer_b16_224 | 128 | 1.0002 | 0.9955 | 0.8018 | 0.9001 | 1.2602 | 1.2488 | | beit_base_patch16_224 | 64 | 0.9999 | 0.9787 | 0.0 | 0.0 | 1.247 | 1.2293 | | deit_base_distilled_patch16_224 | 64 | 1.0 | 0.9918 | 0.7975 | 0.9772 | 1.2389 | 1.2183 | | visformer_small | 128 | 1.0002 | 1.0017 | 0.8372 | 0.0 | 1.2296 | 1.1771 | | sebotnet33ts_256 | 64 | 0.9675 | 0.8368 | 0.6805 | 0.9707 | 1.1954 | 1.2024 | | tf_mixnet_l | 128 | 0.9808 | 0.9094 | 0.795 | 0.0 | 1.1774 | 1.1742 | | mixnet_l | 128 | 0.9788 | 0.9057 | 0.792 | 0.0 | 1.1618 | 1.1567 | | dpn107 | 32 | 0.9603 | 0.9341 | 0.7548 | 0.0 | 1.1584 | 1.1838 | | gluon_xception65 | 32 | 0.9997 | 0.9884 | 0.7533 | 0.0 | 1.1578 | 1.1242 | | vit_base_patch16_224 | 64 | 0.9999 | 0.9939 | 0.8352 | 0.913 | 1.1574 | 1.1416 | | swsl_resnext101_32x16d | 32 | 0.9999 | 0.981 | 0.8104 | 0.0 | 1.134 | 1.0549 | | repvgg_a2 | 128 | 0.9441 | 0.9348 | 0.7989 | 1.0714 | 1.1038 | 1.1218 | | gernet_l | 128 | 0.9472 | 0.9384 | 0.7686 | 1.0621 | 1.0684 | 1.0766 | | convmixer_768_32 | 32 | 0.9999 | 0.9968 | 0.9229 | 0.0 | 1.056 | 1.0505 | | convnext_base | 64 | 0.9992 | 0.995 | 0.8014 | 0.0 | 0.6557 | 0.6462 | | eca_halonext26ts | 128 | 0.9812 | 0.8157 | 0.6795 | 0.0 | 0.0 | 1.1782 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.6226 | 29.5342 | 56.788 | nan | 147.0405 | 140.208 | | twins_pcpvt_base | 64 | 2.8159 | 14.864 | 26.8139 | nan | 132.4825 | 127.3528 | | xcit_large_24_p8_224 | 5 | 3.3294 | nan | nan | nan | 92.7316 | 88.1757 | | pnasnet5large | 16 | 4.8743 | 22.1936 | 40.9045 | nan | 92.1478 | 88.5482 | | cait_m36_384 | 4 | 3.4811 | 19.5215 | nan | nan | 87.2319 | 81.322 | | swin_base_patch4_window7_224 | 64 | 2.9347 | 13.4279 | nan | nan | 81.4185 | 80.6863 | | resnest101e | 64 | 3.3619 | 16.2182 | 27.3915 | nan | 77.5477 | 74.8574 | | convnext_base | 64 | 1.4663 | 6.8641 | 11.6786 | nan | 73.3614 | 71.7117 | | mobilevit_s | 64 | 1.8736 | 7.3732 | 15.5441 | nan | 69.2862 | 69.2486 | | jx_nest_base | 32 | 1.8194 | 9.3206 | 15.1878 | nan | 64.972 | 63.9493 | | res2net101_26w_4s | 64 | 3.3114 | 16.3894 | 27.8099 | nan | 63.1827 | 60.0412 | | coat_lite_mini | 128 | 1.2253 | 5.3013 | 8.4123 | 114.4568 | 60.5961 | 59.4849 | | res2net50_14w_8s | 128 | 2.8591 | 14.3611 | 24.3085 | 334.9766 | 57.0735 | 54.7352 | | poolformer_m36 | 64 | 1.7434 | 7.4866 | 12.4398 | nan | 55.1388 | 52.2338 | | sebotnet33ts_256 | 64 | 1.7096 | 6.1096 | 13.2391 | 154.1498 | 47.8718 | 46.336 | | gmlp_s16_224 | 128 | 1.3183 | 7.2229 | 12.0116 | 196.4853 | 46.873 | 44.1163 | | dpn107 | 32 | 4.3447 | 13.6929 | 39.173 | nan | 46.4329 | 43.4249 | | fbnetv3_b | 128 | 3.3715 | 11.8028 | 28.9303 | nan | 46.0977 | 42.4718 | | crossvit_9_240 | 128 | 1.655 | 8.6252 | 13.8339 | 197.6771 | 45.6034 | 42.6079 | | gluon_xception65 | 32 | 2.1138 | 11.0893 | 17.6537 | nan | 45.531 | 42.1045 | | volo_d1_224 | 64 | 1.3538 | 7.4282 | 12.2682 | nan | 44.3608 | 43.3606 | | tnt_s_patch16_224 | 128 | 1.8119 | 10.5617 | nan | nan | 42.7888 | 40.7197 | | eca_botnext26ts_256 | 128 | 1.4289 | 5.1458 | 11.1034 | 120.3494 | 40.849 | 38.0736 | | adv_inception_v3 | 128 | 1.6614 | 8.5569 | 13.7471 | 186.6205 | 39.3643 | 36.867 | | inception_v3 | 128 | 1.6358 | 8.35 | 13.3789 | 190.489 | 39.2419 | 37.2695 | | dla102 | 128 | 1.9066 | 9.7327 | 15.408 | 246.5233 | 39.0456 | 36.1594 | | gluon_inception_v3 | 128 | 1.6587 | 8.5925 | 13.3194 | 187.8101 | 38.4054 | 37.4456 | | tf_mixnet_l | 128 | 5.9571 | 12.8353 | 26.7905 | nan | 38.3598 | 35.6208 | | ghostnet_100 | 128 | 3.0161 | 9.9407 | 14.2955 | 194.7899 | 37.8446 | 36.691 | | mixnet_l | 128 | 5.5491 | 12.5347 | 27.1026 | nan | 37.586 | 34.7323 | | gmixer_24_224 | 128 | 1.5683 | 8.103 | 13.499 | 189.2535 | 37.1241 | 35.4159 | | swsl_resnext101_32x16d | 32 | 1.8057 | 9.1741 | 14.491 | nan | 36.7626 | 34.639 | | botnet26t_256 | 128 | 1.3921 | 4.4472 | 9.6942 | 93.6987 | 35.313 | 33.629 | | dm_nfnet_f0 | 128 | 2.3588 | 7.4732 | 10.9687 | 165.0736 | 33.779 | 31.695 | | res2next50 | 128 | 1.7463 | 8.0344 | 12.9588 | 198.0037 | 32.5728 | 30.741 | | convit_base | 64 | 1.2757 | 6.0972 | 9.8892 | 144.4587 | 32.0862 | 30.119 | | rexnet_100 | 128 | 1.9463 | 7.292 | 17.0046 | nan | 31.3947 | 29.7666 | | tinynet_a | 128 | 2.1319 | 8.1025 | 19.6506 | 194.4801 | 31.0965 | 29.4123 | | tf_efficientnet_b0 | 128 | 1.8747 | 6.8235 | 16.0637 | 180.8194 | 28.2751 | 25.1444 | | cspdarknet53 | 64 | 2.3654 | 7.659 | 18.9484 | 144.7331 | 26.9277 | 25.7886 | | mixer_b16_224 | 128 | 0.8263 | 4.1401 | 6.281 | 85.9849 | 26.8747 | 26.0792 | | fbnetc_100 | 128 | 2.0837 | 7.0818 | 17.0904 | 135.6763 | 26.2548 | 24.0996 | | convmixer_768_32 | 32 | 1.2947 | 6.4309 | 9.8772 | nan | 25.9289 | 23.9154 | | visformer_small | 128 | 0.9681 | 3.9718 | 6.5622 | nan | 25.7643 | 23.7158 | | spnasnet_100 | 128 | 2.0788 | 6.4628 | 17.5019 | 133.4356 | 25.4794 | 24.5295 | | deit_base_distilled_patch16_224 | 64 | 0.9716 | 4.9945 | 7.4345 | 85.7386 | 25.2464 | 24.5457 | | pit_b_224 | 64 | 1.1736 | 5.3992 | 8.4197 | 112.4694 | 24.9685 | 23.8269 | | vit_base_patch16_224 | 64 | 0.9802 | 4.7447 | 7.302 | 87.2402 | 24.8308 | 23.3658 | | nfnet_l0 | 128 | 1.9309 | 7.3398 | 10.8799 | 147.2274 | 24.5178 | 23.1607 | | mobilenetv3_large_100 | 128 | 1.6081 | 5.7686 | 13.2313 | 144.19 | 24.0542 | 23.2879 | | resmlp_12_224 | 128 | 0.721 | 3.2805 | 4.8068 | 53.107 | 23.8389 | 23.1724 | | beit_base_patch16_224 | 64 | 1.2758 | 5.4147 | nan | nan | 23.426 | 21.578 | | repvgg_a2 | 128 | 2.0496 | 6.1414 | 15.4392 | 191.4397 | 22.4521 | 21.7761 | | mobilenetv2_100 | 128 | 1.6499 | 5.6889 | 12.8638 | 118.2202 | 22.3114 | 20.994 | | regnety_002 | 128 | 1.7273 | 5.9193 | 13.3364 | 115.1358 | 21.3673 | 19.9154 | | mnasnet_100 | 128 | 1.6271 | 5.2204 | 13.137 | 108.7575 | 21.3516 | 19.9214 | | gernet_l | 128 | 2.0604 | 6.3674 | 15.3227 | 115.505 | 21.3151 | 19.7465 | | selecsls42b | 128 | 0.8747 | 3.8421 | 5.8118 | 89.5115 | 18.6223 | 17.7957 | | lcnet_050 | 128 | 1.1695 | 3.5118 | 7.3775 | 80.9954 | 15.4122 | 14.9949 | | ese_vovnet19b_dw | 128 | 1.1112 | 3.2247 | 6.6998 | 67.2905 | 14.5279 | 13.5333 | | eca_halonext26ts | 128 | 1.5223 | 5.2082 | 10.8948 | nan | nan | 56.0342 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2764 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9938 | 0.9715 | 0.3562 | 1.3557 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2664 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3632 | nan | 1.1604 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.285 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 1.1067 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1022 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2684 | 0.3766 | 1.0332 | 1.1821 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9621 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3303 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3677 | nan | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.353 | nan | 0.9072 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3529 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2716 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9913 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3361 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.806 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.745 | 0.8293 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3408 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.267 | nan | nan | 1.2904 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 297.8088 | 298.2404 | 321.9984 | nan | 281.7324 | 282.9304 | | tnt_s_patch16_224 | 128 | 364.4464 | 365.146 | nan | nan | 189.6325 | 192.7252 | | hrnet_w18 | 128 | 309.9 | 290.9906 | 346.6951 | nan | 187.8348 | 204.1808 | | convnext_base | 64 | 121.5283 | 121.9944 | 151.7569 | nan | 185.6819 | 187.7159 | | pnasnet5large | 16 | 217.4701 | 219.3681 | 259.5399 | nan | 169.3958 | 174.139 | | tf_mixnet_l | 128 | 195.534 | 210.902 | 240.953 | nan | 162.9251 | 163.3113 | | mixnet_l | 128 | 187.1629 | 202.611 | 231.7355 | nan | 157.7135 | 158.5255 | | convit_base | 64 | 181.4439 | 182.4354 | 217.7961 | 146.8016 | 135.8213 | 136.5346 | | pit_b_224 | 64 | 155.2666 | 155.7445 | 188.5455 | 159.5319 | 118.0113 | 118.4023 | | cait_m36_384 | 4 | 166.1951 | 164.7583 | nan | nan | 117.5656 | 121.563 | | dla102 | 128 | 179.1171 | 179.3887 | 213.3111 | 136.4544 | 113.1141 | 115.4216 | | poolformer_m36 | 64 | 148.9069 | 149.5592 | 185.8008 | nan | 112.2725 | 115.0318 | | beit_base_patch16_224 | 64 | 135.207 | 138.0861 | nan | nan | 108.6438 | 109.9743 | | resnest101e | 64 | 167.6791 | 164.727 | 200.0415 | nan | 107.6926 | 114.9245 | | inception_v3 | 128 | 161.228 | 161.7714 | 188.7027 | 140.8556 | 107.2963 | 109.7305 | | adv_inception_v3 | 128 | 161.2642 | 162.1174 | 189.5478 | 141.4972 | 107.2883 | 109.5537 | | gluon_inception_v3 | 128 | 161.6442 | 162.1483 | 189.0445 | 142.4681 | 107.2764 | 109.5591 | | vit_base_patch16_224 | 64 | 120.7961 | 121.4782 | 144.7481 | 132.1598 | 104.3651 | 105.7614 | | swsl_resnext101_32x16d | 32 | 118.1403 | 119.8185 | 146.1128 | nan | 104.1725 | 111.3394 | | swin_base_patch4_window7_224 | 64 | 147.6939 | 153.7083 | nan | nan | 102.2266 | 103.8756 | | res2net50_14w_8s | 128 | 145.9229 | 147.3056 | 180.5367 | 146.6579 | 99.7575 | 106.7407 | | res2next50 | 128 | 138.8632 | 139.1881 | 166.4971 | 120.5956 | 98.0075 | 102.6048 | | mixer_b16_224 | 128 | 118.7145 | 119.1769 | 148.2558 | 131.8891 | 94.2499 | 94.9761 | | dpn107 | 32 | 120.1273 | 115.7593 | 143.2149 | nan | 93.1316 | 91.5798 | | gmlp_s16_224 | 128 | 136.5303 | 136.732 | 173.398 | 135.0804 | 89.2199 | 90.6361 | | jx_nest_base | 32 | 119.0509 | 119.9655 | 148.4673 | nan | 87.6157 | 89.6547 | | dm_nfnet_f0 | 128 | 131.8899 | 131.7587 | 149.1318 | 142.2863 | 87.4121 | 91.9688 | | volo_d1_224 | 64 | 134.6085 | 135.5924 | 159.2969 | nan | 86.6529 | 88.7833 | | eca_botnext26ts_256 | 128 | 112.2357 | 135.7781 | 164.5747 | 102.6308 | 86.3411 | 86.6101 | | fbnetv3_b | 128 | 121.2353 | 122.949 | 152.8495 | nan | 84.8251 | 83.903 | | gluon_xception65 | 32 | 97.9663 | 98.8841 | 130.1001 | nan | 84.5613 | 87.121 | | gmixer_24_224 | 128 | 119.9382 | 136.181 | 166.5128 | 130.2972 | 81.2719 | 81.8882 | | visformer_small | 128 | 98.285 | 98.178 | 117.0475 | nan | 79.8377 | 83.3363 | | crossvit_9_240 | 128 | 109.3424 | 110.608 | 131.3274 | 119.0061 | 78.604 | 79.8679 | | botnet26t_256 | 128 | 106.0603 | 106.5583 | 128.011 | 81.2619 | 78.4657 | 78.1263 | | res2net101_26w_4s | 64 | 122.8013 | 123.2288 | 127.0794 | nan | 77.8024 | 101.6518 | | twins_pcpvt_base | 64 | 126.7887 | 137.8224 | 138.9156 | nan | 76.7195 | 82.5777 | | deit_base_distilled_patch16_224 | 64 | 94.4566 | 95.1195 | 118.2294 | 96.4269 | 76.1539 | 77.4953 | | coat_lite_mini | 128 | 115.9788 | 117.2467 | 138.0034 | 100.7785 | 72.4322 | 73.9398 | | gernet_l | 128 | 79.9185 | 80.5277 | 98.7402 | 71.3023 | 70.9703 | 70.3282 | | cspdarknet53 | 64 | 96.1234 | 96.8981 | 120.03 | 79.518 | 69.2835 | 68.8108 | | rexnet_100 | 128 | 91.2662 | 103.2229 | 127.5117 | nan | 68.8251 | 68.8532 | | repvgg_a2 | 128 | 79.7035 | 80.4905 | 94.3232 | 70.2202 | 68.2672 | 67.1899 | | nfnet_l0 | 128 | 106.0156 | 130.9001 | 148.7688 | 124.8233 | 68.1891 | 72.087 | | sebotnet33ts_256 | 64 | 83.255 | 96.1117 | 118.5967 | 82.9216 | 67.3119 | 66.9266 | | tf_efficientnet_b0 | 128 | 90.6343 | 108.294 | 131.5146 | 92.0005 | 65.4973 | 64.5025 | | mobilevit_s | 64 | 90.0936 | 107.6137 | 133.706 | nan | 63.9223 | 64.4811 | | xcit_large_24_p8_224 | 5 | 128.2885 | nan | nan | nan | 62.6783 | 72.7072 | | fbnetc_100 | 128 | 88.0572 | 89.184 | 105.8808 | 72.2741 | 62.0964 | 61.005 | | tinynet_a | 128 | 75.448 | 91.8528 | 110.9989 | 93.7472 | 58.0418 | 58.2263 | | resmlp_12_224 | 128 | 68.2466 | 68.3992 | 87.3133 | 46.0413 | 53.2898 | 54.0599 | | spnasnet_100 | 128 | 76.6577 | 77.334 | 93.9757 | 65.759 | 53.1446 | 52.3476 | | ese_vovnet19b_dw | 128 | 67.9632 | 68.5696 | 86.1111 | 58.4197 | 47.9737 | 47.8076 | | mnasnet_100 | 128 | 70.3077 | 70.9438 | 85.0226 | 55.7882 | 46.7635 | 45.9917 | | ghostnet_100 | 128 | 95.5551 | 98.2052 | 108.2074 | 94.997 | 46.1392 | 54.7277 | | mobilenetv2_100 | 128 | 67.7774 | 68.8122 | 89.4174 | 57.697 | 45.9385 | 44.9514 | | selecsls42b | 128 | 62.8837 | 63.1926 | 74.7679 | 48.9187 | 43.6383 | 44.7509 | | mobilenetv3_large_100 | 128 | 65.9979 | 67.0558 | 80.7117 | 64.6866 | 43.312 | 43.9542 | | regnety_002 | 128 | 54.7963 | 59.2726 | 47.1703 | 62.1244 | 25.4957 | 37.7173 | | lcnet_050 | 128 | 35.5333 | 36.2729 | 38.8231 | 32.3857 | 16.3823 | 22.2099 | | eca_halonext26ts | 128 | 116.0678 | 139.649 | 167.9163 | nan | nan | 96.7098 | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-12/huggingface_amp.png : ![](https://i.imgur.com/gWCqThE.png) ../test-dynamo-runner-logs-12/torchbench_amp.png : ![](https://i.imgur.com/i8uC4cE.png) ../test-dynamo-runner-logs-12/timm_models_amp.png : ![](https://i.imgur.com/78F5tIw.png)

williamwen42 commented 2 years ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 93%, 57/61  |
|     aot_cudagraphs     | 85%, 46/54 | 81%, 34/42  | 89%, 54/61  |
|    nvprims_nvfuser     | 59%, 32/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 90%, 38/42  | 90%, 55/61  |
| inductor_no_cudagraphs | 85%, 46/54 | 90%, 38/42  | 90%, 55/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.12x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.04x    |    1.08x    |
|        inductor        |   1.84x    |    1.74x    |    1.41x    |
| inductor_no_cudagraphs |   1.38x    |    1.53x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.06    |    2.84     |    2.33     |
|       aot_eager        |    6.61    |    10.24    |    8.69     |
|     aot_cudagraphs     |    9.51    |    16.50    |    16.36    |
|    nvprims_nvfuser     |   66.11    |   133.86    |   151.35    |
|        inductor        |   33.97    |    38.49    |    44.16    |
| inductor_no_cudagraphs |   34.21    |    33.58    |    41.73    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.83x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.96x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: day_320_16_11_22_performance_amp_406 Previous report name: day_319_15_11_22_performance_amp_653 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 81%, 44/54 | 83%, 45/54 | | inductor | huggingface | 90%, 38/42 | 90%, 38/42 | | inductor | timm_models | 90%, 55/61 | 92%, 56/61 | | inductor_no_cudagraphs | torchbench | 85%, 46/54 | 89%, 48/54 | | inductor_no_cudagraphs | huggingface | 90%, 38/42 | 90%, 38/42 | | inductor_no_cudagraphs | timm_models | 90%, 55/61 | 92%, 56/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.84x | 1.87x | | inductor | huggingface | 1.74x | 1.73x | | inductor | timm_models | 1.41x | 1.40x | | inductor_no_cudagraphs | torchbench | 1.38x | 1.37x | | inductor_no_cudagraphs | huggingface | 1.53x | 1.52x | | inductor_no_cudagraphs | timm_models | 1.36x | 1.35x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | vision_maskrcnn | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | BigBird | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | poolformer_m36 | fail_accuracy | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8633 | | torchbench | tacotron2 | 0.0 | 0.8824 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6631 | 0.6452 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 404.1995 | 416.489 | | torchbench | timm_efficientdet | 146.2678 | 144.8974 | | torchbench | hf_T5_large | 145.3088 | 139.5987 | | timm_models | hrnet_w18 | 150.2292 | 136.4794 | | timm_models | twins_pcpvt_base | 130.834 | 129.5663 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | speech_transformer | 0.8824 | 0.8866 | | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8778 | 1.0948 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0014 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4481 | 0.4691 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8828 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9876 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9516 | | huggingface | DistillGPT2 | 0.8048 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.752 | 0.9892 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.026 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | convnext_base | 0.806 | 0.9865 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8294 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor, suite: torchbench): day_319_15_11_22_performance_amp_653 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor_no_cudagraphs, suite: torchbench): day_319_15_11_22_performance_amp_653 No regressions found. ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor, suite: huggingface): day_319_15_11_22_performance_amp_653 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor_no_cudagraphs, suite: huggingface): day_319_15_11_22_performance_amp_653 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor, suite: timm_models): day_319_15_11_22_performance_amp_653 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): day_320_16_11_22_performance_amp_406 Previous report name (comipler: inductor_no_cudagraphs, suite: timm_models): day_319_15_11_22_performance_amp_653 No regressions found.

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 0.9269 | 2.4759 | 0.7336 | 6.1007 | 1.3179 | | functorch_dp_cifar10 | 64 | 1.0025 | 0.959 | 2.3644 | 0.0 | 5.0593 | 0.9792 | | timm_efficientdet | 1 | 0.9846 | 0.8224 | 2.1111 | 0.0 | 4.754 | 1.5319 | | resnext50_32x4d | 8 | 1.0029 | 0.9629 | 1.9044 | 0.7558 | 3.5498 | 1.2678 | | timm_vision_transformer | 8 | 1.0015 | 0.8456 | 1.8027 | 0.59 | 3.4415 | 1.532 | | BERT_pytorch | 16 | 1.0065 | 0.8313 | 1.5678 | 0.8309 | 3.366 | 2.332 | | mobilenet_v3_large | 32 | 1.0033 | 1.0061 | 1.6121 | 0.7691 | 3.0827 | 1.3913 | | drq | 1 | 1.0088 | 0.8228 | 1.9929 | 0.608 | 3.0015 | 1.1596 | | dcgan | 32 | 0.9819 | 0.9163 | 1.6644 | 0.7106 | 2.8668 | 1.0467 | | resnet18 | 16 | 1.0017 | 0.997 | 1.584 | 0.7957 | 2.8116 | 1.2074 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9956 | 0.976 | 1.7732 | 0.0 | 2.7857 | 1.5668 | | hf_T5_large | 2 | 1.0196 | 0.8562 | 0.0 | 0.0 | 2.6305 | 2.1346 | | mnasnet1_0 | 32 | 1.0 | 1.021 | 1.2678 | 0.7709 | 2.6232 | 1.3497 | | squeezenet1_1 | 32 | 0.9942 | 0.9626 | 1.4509 | 0.7253 | 2.4487 | 1.3039 | | hf_Albert | 8 | 1.0025 | 0.9621 | 0.7743 | 0.0 | 2.3629 | 2.2746 | | hf_GPT2 | 4 | 1.0238 | 0.9834 | 0.8156 | 0.2905 | 2.128 | 1.9203 | | pytorch_struct | 200 | 0.9858 | 0.7499 | 1.0158 | 0.5997 | 2.1278 | 1.28 | | timm_efficientnet | 32 | 0.9617 | 0.819 | 1.0779 | 0.6806 | 2.1064 | 1.2819 | | hf_Bert | 4 | 1.0358 | 0.8393 | 0.9547 | 0.0 | 2.0757 | 1.8356 | | lennard_jones | 1000 | 0.9695 | 0.7698 | 1.3011 | 0.4693 | 2.0722 | 1.0623 | | resnet152 | 32 | 1.0018 | 1.0101 | 1.2666 | 0.0 | 2.0638 | 1.3011 | | timm_resnest | 32 | 1.0068 | 1.0167 | 0.8369 | 0.9652 | 1.9156 | 1.6651 | | hf_T5 | 8 | 0.9997 | 0.919 | 0.0 | 1.3547 | 1.8668 | 1.8751 | | resnet50 | 32 | 1.0015 | 1.0246 | 1.0439 | 0.811 | 1.8012 | 1.3458 | | LearningToPaint | 96 | 1.003 | 1.0147 | 1.1631 | 0.8377 | 1.7935 | 1.3141 | | hf_Bart | 4 | 1.0128 | 0.8329 | 0.9446 | 0.0 | 1.758 | 1.8321 | | soft_actor_critic | 256 | 1.0176 | 0.7414 | 1.3388 | 0.5477 | 1.746 | 1.0551 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0223 | 0.9819 | 0.8605 | 1.703 | 1.4324 | | mobilenet_v2 | 96 | 1.0001 | 1.0065 | 0.7606 | 1.0345 | 1.5589 | 1.5181 | | speech_transformer | 32 | 0.9559 | 0.8244 | 1.7561 | 0.0 | 1.5304 | 1.5474 | | attention_is_all_you_need_pytorch | 256 | 1.0068 | 0.9027 | 0.8406 | 0.0 | 1.5285 | 1.58 | | timm_nfnet | 128 | 0.9991 | 1.0 | 0.8727 | 0.92 | 1.5078 | 1.4307 | | fastNLP_Bert | 6 | 0.9992 | 0.8893 | 0.7649 | 0.0 | 1.5043 | 1.4513 | | hf_DistilBert | 8 | 1.0017 | 0.9746 | 0.742 | 0.3688 | 1.492 | 1.4593 | | pytorch_stargan | 16 | 0.9951 | 1.0961 | 1.0396 | 0.0 | 1.4619 | 1.5082 | | pytorch_unet | 1 | 0.9996 | 0.9921 | 0.8639 | 1.0838 | 1.3621 | 1.331 | | timm_regnet | 32 | 0.9786 | 0.9422 | 0.9011 | 0.7826 | 1.3385 | 1.2223 | | timm_vovnet | 32 | 0.9205 | 0.8797 | 0.8693 | 0.7984 | 1.2996 | 1.1491 | | vgg16 | 64 | 0.9996 | 0.9972 | 0.8566 | 0.9734 | 1.2708 | 1.2639 | | Background_Matting | 4 | 0.9999 | 1.0155 | 0.8959 | 1.0571 | 1.2373 | 1.2197 | | Super_SloMo | 6 | 0.9993 | 0.995 | 0.8851 | 0.0 | 1.2277 | 1.1941 | | alexnet | 128 | 0.999 | 0.9977 | 0.815 | 0.928 | 1.2089 | 1.2102 | | hf_Reformer | 4 | 0.9987 | 1.0002 | 0.9928 | 0.6513 | 1.1761 | 1.1801 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9903 | 0.0 | 0.0 | 1.0903 | 1.0719 | | yolov3 | 16 | 0.9997 | 0.9906 | 0.8035 | 0.0 | 1.0881 | 1.0689 | | tts_angular | 64 | 0.975 | 0.9437 | 0.9749 | 0.9511 | 1.0167 | 1.0065 | | demucs | 4 | 1.0014 | 1.0 | 1.0002 | 0.998 | 1.0017 | 1.0006 | | nvidia_deeprecommender | 256 | 0.9989 | 0.996 | 0.697 | 1.0074 | 0.9892 | 1.0305 | | hf_GPT2_large | 4 | 1.0002 | 0.9907 | 0.0 | 0.0 | 0.0 | 1.8633 | | tacotron2 | 64 | 0.988 | 0.7645 | 0.9786 | 0.5994 | 0.0 | 0.8824 | | dlrm | 2048 | 1.01 | 1.1541 | 0.0 | 1.1273 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9843 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.1344 | 8.4555 | 11.8143 | nan | 404.1995 | 416.489 | | timm_efficientdet | 1 | 20.2634 | 39.3318 | 77.1986 | nan | 146.2678 | 144.8974 | | hf_T5_large | 2 | 14.8888 | 39.8919 | nan | nan | 145.3088 | 139.5987 | | timm_vision_transformer_large | 8 | 3.0569 | 15.4787 | nan | nan | 72.4614 | 69.0113 | | resnet152 | 32 | 2.7633 | 14.3372 | 22.2274 | nan | 53.5223 | 52.8041 | | densenet121 | 4 | 2.4205 | 12.1053 | 19.2315 | 234.7833 | 52.0325 | 51.0502 | | attention_is_all_you_need_pytorch | 256 | 1.4406 | 7.2285 | 11.5793 | nan | 40.3385 | 39.5388 | | timm_resnest | 32 | 0.6749 | 2.5525 | 3.8456 | 66.1754 | 39.2066 | 38.0779 | | speech_transformer | 32 | 2.0109 | 8.9765 | 34.3297 | nan | 36.4502 | 34.9328 | | hf_Bart | 4 | 2.0912 | 9.0258 | 14.3815 | nan | 36.1823 | 35.5936 | | timm_vision_transformer | 8 | 1.031 | 4.6098 | 6.8122 | 84.3136 | 35.9455 | 35.3652 | | BERT_pytorch | 16 | 1.8364 | 7.6958 | 11.5181 | 134.15 | 35.846 | 35.7492 | | fastNLP_Bert | 6 | 1.9116 | 7.2674 | 11.4915 | nan | 33.1912 | 30.7348 | | timm_nfnet | 128 | 2.2018 | 7.4185 | 11.3308 | 159.2067 | 32.7195 | 32.5884 | | hf_T5 | 8 | 2.7481 | 9.1221 | nan | 107.4401 | 32.4549 | 31.0626 | | timm_regnet | 32 | 2.4918 | 8.6327 | 19.8749 | 145.7126 | 28.8942 | 28.5276 | | pytorch_stargan | 16 | 0.4649 | 2.1492 | 2.9664 | nan | 28.1889 | 26.0531 | | timm_efficientnet | 32 | 1.9295 | 7.3003 | 15.603 | 151.8019 | 27.4539 | 27.061 | | mobilenet_v3_large | 32 | 1.0471 | 4.797 | 7.372 | 119.58 | 26.0092 | 25.9043 | | hf_Bert | 4 | 1.8867 | 7.2902 | 10.3758 | nan | 24.4481 | 23.5085 | | hf_Albert | 8 | 1.6492 | 6.7058 | 10.2512 | nan | 23.2417 | 22.1951 | | functorch_dp_cifar10 | 64 | 0.3445 | 1.4309 | 2.1635 | nan | 22.6126 | 22.7953 | | pytorch_struct | 200 | 0.2883 | 0.8641 | 1.6177 | 7.6025 | 22.4876 | 22.2684 | | mnasnet1_0 | 32 | 0.9474 | 4.3783 | 6.6271 | 88.0587 | 21.6773 | 21.1308 | | hf_GPT2 | 4 | 1.8656 | 6.5011 | 9.3689 | 114.2712 | 21.0139 | 20.0732 | | resnet50 | 32 | 1.0144 | 4.9032 | 6.8316 | 99.4446 | 20.7345 | 20.4962 | | shufflenet_v2_x1_0 | 128 | 1.1795 | 5.4288 | 7.7034 | 101.6202 | 20.552 | 20.3635 | | resnext50_32x4d | 8 | 1.0804 | 4.6139 | 6.8963 | 84.0624 | 20.3942 | 19.7537 | | timm_vovnet | 32 | 1.6063 | 4.5008 | 10.0066 | 72.0347 | 20.2442 | 19.9906 | | mobilenet_v2 | 96 | 0.9566 | 4.9474 | 7.0675 | 116.743 | 19.8935 | 19.3603 | | Background_Matting | 4 | 0.9599 | 4.4259 | 6.5969 | 96.3123 | 19.0115 | 17.7799 | | hf_Reformer | 4 | 1.6744 | 3.0553 | 5.483 | 17.8538 | 18.9639 | 16.267 | | Super_SloMo | 6 | 0.9908 | 4.0544 | 5.7403 | nan | 17.5075 | 16.591 | | hf_DistilBert | 8 | 0.8338 | 3.5538 | 5.7907 | 64.2335 | 15.5308 | 14.8387 | | resnet18 | 16 | 0.4733 | 1.8125 | 2.6291 | 38.0577 | 11.5643 | 11.5422 | | dcgan | 32 | 0.1827 | 0.4312 | 0.679 | 5.0555 | 10.388 | 9.9073 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4759 | 2.0248 | 2.852 | nan | 9.1949 | 9.0932 | | pytorch_unet | 1 | 0.4486 | 1.9193 | 2.7622 | 38.7551 | 8.4807 | 8.2167 | | LearningToPaint | 96 | 0.4988 | 1.9185 | 2.8943 | 47.227 | 8.2431 | 7.8678 | | squeezenet1_1 | 32 | 0.2749 | 0.9414 | 1.4095 | 6.9055 | 4.7654 | 4.5126 | | vgg16 | 64 | 0.209 | 0.6473 | 1.102 | 5.6309 | 4.2742 | 3.9492 | | drq | 1 | 0.3217 | 0.6423 | 1.0229 | 6.1416 | 4.2633 | 3.6368 | | nvidia_deeprecommender | 256 | 0.2211 | 0.5266 | 0.8912 | 5.6896 | 3.745 | 3.4994 | | soft_actor_critic | 256 | 0.2103 | 0.3601 | 0.5803 | 3.2728 | 3.5436 | 3.0174 | | alexnet | 128 | 0.1783 | 0.4468 | 0.7337 | 5.1929 | 3.325 | 3.3008 | | lennard_jones | 1000 | 0.1589 | 0.367 | 0.5531 | 2.942 | 2.3328 | 1.9799 | | tts_angular | 64 | 0.1937 | 0.2399 | 0.3659 | 1.5238 | 1.9197 | 1.7273 | | demucs | 4 | 0.3371 | 0.3585 | 0.3553 | 0.3639 | 0.2731 | 0.2673 | | hf_GPT2_large | 4 | 5.7771 | 20.2502 | nan | nan | nan | 58.0332 | | tacotron2 | 64 | 6.9867 | 20.1316 | 34.6561 | 91.2874 | nan | 45.901 | | dlrm | 2048 | 0.4851 | 0.8588 | nan | 4.3981 | nan | nan | | hf_BigBird | 2 | 4.0095 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2718 | 0.4638 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4815 | 1.0334 | 1.1302 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3079 | nan | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0142 | 0.9624 | 0.3723 | 0.9771 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.38 | 1.118 | 0.9649 | 1.1241 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9345 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.014 | 0.9304 | 1.2458 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3631 | nan | 0.9125 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.9063 | 1.0466 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3341 | nan | 0.8824 | 0.8866 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | nan | nan | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.3998 | 1.1039 | 0.8778 | 1.0948 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3561 | 0.7806 | 0.8659 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3414 | 1.0617 | 0.8348 | 0.9049 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3407 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3888 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.383 | 0.6701 | 0.7295 | 0.925 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3918 | 1.0881 | 0.7133 | 0.7227 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0014 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4481 | 0.4691 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3906 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 183.9264 | 185.8603 | nan | nan | 168.7355 | 171.7368 | | Background_Matting | 4 | 141.7648 | 131.3625 | 148.8745 | 126.0196 | 107.7583 | 109.2646 | | hf_T5 | 8 | 174.4926 | 189.4402 | nan | 128.4126 | 93.3232 | 92.963 | | hf_T5_large | 2 | 218.1989 | 260.3696 | nan | nan | 89.1603 | 110.8583 | | timm_nfnet | 128 | 131.8874 | 131.5784 | 149.7607 | 142.2406 | 87.2286 | 91.5348 | | hf_Reformer | 4 | 82.3598 | 82.1781 | 82.8209 | 126.1949 | 69.8371 | 69.6854 | | Super_SloMo | 6 | 79.0805 | 79.3464 | 89.5435 | nan | 64.5105 | 66.2058 | | yolov3 | 16 | 68.667 | 69.0193 | 85.3037 | nan | 62.9919 | 64.2431 | | demucs | 4 | 57.9343 | 57.1161 | 57.2196 | 57.1238 | 57.0935 | 57.2062 | | timm_regnet | 32 | 73.5289 | 81.4089 | 81.1558 | 91.6902 | 55.1698 | 60.0619 | | vgg16 | 64 | 66.2422 | 66.2093 | 77.0694 | 67.8099 | 52.002 | 52.3533 | | resnet152 | 32 | 91.0037 | 97.7275 | 73.4911 | nan | 45.7896 | 73.8826 | | speech_transformer | 32 | 65.2427 | 75.1714 | 34.8839 | nan | 41.5773 | 40.3135 | | fastNLP_Bert | 6 | 55.9758 | 62.4977 | 72.653 | nan | 37.2314 | 38.5491 | | timm_efficientdet | 1 | 163.1827 | 214.6085 | 76.5472 | nan | 36.1618 | 110.5349 | | attention_is_all_you_need_pytorch | 256 | 52.8984 | 59.2412 | 63.2279 | nan | 34.8035 | 37.186 | | hf_Bart | 4 | 55.5883 | 67.7852 | 65.7889 | nan | 33.957 | 36.148 | | mobilenet_v2 | 96 | 48.8565 | 49.4278 | 64.2011 | 47.2261 | 31.3401 | 32.1664 | | hf_Albert | 8 | 68.2827 | 72.0985 | 88.2802 | nan | 29.3207 | 29.982 | | pytorch_unet | 1 | 39.9271 | 40.1581 | 46.2402 | 36.8037 | 29.3201 | 29.9666 | | hf_GPT2 | 4 | 52.4292 | 49.6814 | 60.1295 | 168.5094 | 25.4594 | 25.8753 | | timm_vovnet | 32 | 34.752 | 38.1958 | 37.1268 | 40.731 | 24.8979 | 28.7185 | | shufflenet_v2_x1_0 | 128 | 42.876 | 42.1499 | 41.6597 | 49.9317 | 24.2456 | 29.1135 | | timm_efficientnet | 32 | 48.7395 | 61.4532 | 43.3363 | 69.7235 | 22.4523 | 37.767 | | hf_Bert | 4 | 40.6596 | 58.173 | 44.0914 | nan | 21.2743 | 23.4495 | | hf_DistilBert | 8 | 30.9806 | 31.8895 | 41.8606 | 84.3181 | 20.8157 | 21.2662 | | resnet50 | 32 | 33.7115 | 35.1441 | 32.3154 | 41.5451 | 19.3801 | 27.46 | | BERT_pytorch | 16 | 55.6925 | 66.4554 | 35.0948 | 66.3584 | 16.8192 | 24.9592 | | timm_resnest | 32 | 25.0597 | 24.8603 | 29.4839 | 25.4079 | 12.8525 | 15.7415 | | densenet121 | 4 | 72.9717 | 81.5106 | 29.9783 | 100.9377 | 12.6771 | 59.61 | | mobilenet_v3_large | 32 | 34.9903 | 34.941 | 24.01 | 47.3614 | 11.9799 | 26.5817 | | mnasnet1_0 | 32 | 28.9991 | 28.4173 | 23.1117 | 38.0174 | 11.4931 | 22.367 | | pytorch_stargan | 16 | 16.102 | 15.896 | 15.4703 | nan | 10.9192 | 11.5913 | | nvidia_deeprecommender | 256 | 10.3666 | 10.4037 | 14.8759 | 10.2899 | 10.4632 | 10.05 | | timm_vision_transformer | 8 | 33.921 | 34.6595 | 16.5079 | 50.2954 | 9.9535 | 20.4789 | | resnext50_32x4d | 8 | 33.0804 | 30.4899 | 15.5983 | 43.0998 | 8.4924 | 23.3554 | | LearningToPaint | 96 | 15.4426 | 14.8511 | 12.7876 | 18.0053 | 8.4605 | 11.4183 | | alexnet | 128 | 9.7884 | 9.8124 | 12.0045 | 10.5796 | 8.0901 | 8.1139 | | tts_angular | 64 | 6.9398 | 6.5844 | 6.4183 | 6.8377 | 6.7018 | 7.2245 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.098 | 18.5743 | 10.1651 | nan | 6.6768 | 11.9156 | | squeezenet1_1 | 32 | 15.1004 | 15.4611 | 10.1004 | 20.9457 | 6.215 | 11.7538 | | resnet18 | 16 | 12.9642 | 13.1091 | 8.0182 | 16.4295 | 4.7231 | 11.7877 | | functorch_dp_cifar10 | 64 | 14.21 | 15.0108 | 6.0075 | nan | 2.9591 | 15.0933 | | pytorch_struct | 200 | 4.6757 | 6.1055 | 4.513 | 7.7912 | 2.277 | 3.7498 | | drq | 1 | 3.8879 | 4.7963 | 1.9564 | 6.6998 | 1.3729 | 3.6489 | | dcgan | 32 | 3.1322 | 3.4376 | 1.8964 | 4.4751 | 1.107 | 2.9838 | | soft_actor_critic | 256 | 1.3741 | 1.8739 | 1.0807 | 2.8127 | 0.8557 | 1.4163 | | lennard_jones | 1000 | 1.4503 | 2.1461 | 1.1727 | 3.1974 | 0.749 | 1.4673 | | tacotron2 | 64 | 3526.5577 | 4226.6164 | 3367.1203 | 5061.5669 | nan | 3532.7074 | | hf_GPT2_large | 4 | 209.2206 | 211.7662 | nan | nan | nan | 112.3685 | | dlrm | 2048 | 501.5169 | 490.557 | nan | 499.3797 | nan | nan | | hf_BigBird | 2 | 195.5097 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0223 | 0.8377 | 2.3103 | 0.0 | 4.8405 | 1.6583 | | MobileBertForMaskedLM | 32 | 1.0172 | 0.8422 | 2.0319 | 0.0 | 4.1581 | 1.8028 | | CamemBert | 1 | 1.0447 | 0.8521 | 1.8013 | 0.0 | 3.7763 | 1.7973 | | MobileBertForQuestionAnswering | 64 | 1.0168 | 0.8377 | 1.5134 | 0.0 | 3.6592 | 1.7789 | | MT5ForConditionalGeneration | 8 | 1.0153 | 0.8552 | 1.5607 | 0.8664 | 3.4685 | 2.5255 | | DistillGPT2 | 1 | 1.0365 | 0.8788 | 1.4926 | 0.0 | 2.704 | 2.0011 | | GPT2ForSequenceClassification | 4 | 1.0029 | 0.9693 | 0.0 | 0.5045 | 2.3192 | 2.2924 | | M2M100ForConditionalGeneration | 8 | 1.0065 | 0.9218 | 1.2466 | 0.7002 | 2.2067 | 1.7105 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9797 | 0.7678 | 0.0 | 2.0342 | 1.9779 | | MegatronBertForQuestionAnswering | 16 | 1.0356 | 0.8521 | 1.0639 | 0.0 | 1.95 | 1.8031 | | PLBartForConditionalGeneration | 16 | 1.0125 | 0.8352 | 1.0355 | 0.0 | 1.8827 | 1.6882 | | MegatronBertForCausalLM | 16 | 1.0334 | 0.8527 | 0.9918 | 0.0 | 1.8022 | 1.7497 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9803 | 0.7756 | 0.0 | 1.7954 | 1.7491 | | ElectraForCausalLM | 32 | 0.9998 | 0.9298 | 0.7149 | 0.0 | 1.7505 | 1.7562 | | XGLMForCausalLM | 8 | 1.0122 | 0.8251 | 0.934 | 0.0 | 1.7391 | 1.7801 | | T5Small | 1 | 1.0264 | 0.9043 | 1.1552 | 0.8555 | 1.7388 | 1.5015 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8859 | 0.0 | 0.0 | 1.6477 | 1.6393 | | AlbertForMaskedLM | 4 | 1.0002 | 0.885 | 0.0 | 0.0 | 1.6361 | 1.6283 | | MBartForConditionalGeneration | 16 | 1.0151 | 0.8351 | 0.9222 | 0.0 | 1.6334 | 1.5862 | | PegasusForConditionalGeneration | 16 | 1.0127 | 0.8279 | 0.9093 | 0.6363 | 1.6253 | 1.529 | | LayoutLMForMaskedLM | 16 | 1.0008 | 0.9707 | 0.7557 | 0.0 | 1.606 | 1.5814 | | T5ForConditionalGeneration | 4 | 1.0079 | 0.9015 | 0.758 | 1.1634 | 1.6022 | 1.5676 | | OPTForCausalLM | 32 | 1.0068 | 0.9306 | 0.7722 | 0.3392 | 1.5325 | 1.5097 | | Speech2Text2ForCausalLM | 128 | 1.0069 | 0.9343 | 0.7224 | 0.8106 | 1.4927 | 1.4985 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9849 | 0.7793 | 0.0 | 1.4461 | 1.4066 | | DistilBertForQuestionAnswering | 64 | 1.0007 | 0.9477 | 0.7432 | 0.3628 | 1.442 | 1.3996 | | BertForQuestionAnswering | 128 | 1.0 | 0.9745 | 0.7777 | 0.0 | 1.4387 | 1.4119 | | BartForConditionalGeneration | 2 | 1.0045 | 0.9697 | 0.0 | 0.0 | 1.4202 | 1.3891 | | BartForCausalLM | 4 | 1.0011 | 0.9698 | 0.758 | 0.0 | 1.4151 | 1.4143 | | RobertaForCausalLM | 64 | 1.0004 | 0.9603 | 0.7542 | 0.0 | 1.4004 | 1.3807 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.8829 | 0.7443 | 0.0 | 1.379 | 1.3854 | | DebertaForMaskedLM | 4 | 0.9208 | 0.7366 | 0.8007 | 0.0 | 1.2999 | 1.1375 | | BertForMaskedLM | 64 | 1.0005 | 0.9564 | 0.7403 | 0.0 | 1.2988 | 1.2848 | | PLBartForCausalLM | 32 | 1.0067 | 0.9416 | 0.7926 | 0.8407 | 1.2218 | 1.2467 | | BlenderbotSmallForCausalLM | 64 | 1.0018 | 0.9261 | 0.718 | 0.0 | 1.2135 | 1.2264 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9392 | 0.7091 | 0.4614 | 1.2126 | 1.2118 | | MBartForCausalLM | 32 | 1.0036 | 0.9427 | 0.7569 | 0.0 | 1.1666 | 1.1628 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9485 | 0.7578 | 0.0 | 1.1621 | 1.1628 | | DebertaForQuestionAnswering | 8 | 0.9861 | 0.8674 | 0.7219 | 0.0 | 1.1368 | 1.211 | | PegasusForCausalLM | 32 | 0.9991 | 0.9505 | 0.7532 | 0.8471 | 1.1354 | 1.1366 | | BigBird | 1 | 0.978 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForMaskedLM | 4 | 5.3801 | 11.2863 | 35.4169 | nan | 105.4539 | 39.865 | | DebertaForQuestionAnswering | 8 | 5.2502 | 11.0802 | 36.1816 | nan | 103.2721 | 39.8822 | | MobileBertForMaskedLM | 32 | 10.1116 | 35.1002 | 58.9281 | nan | 84.8629 | 81.364 | | MobileBertForQuestionAnswering | 64 | 10.3878 | 35.1481 | 58.0209 | nan | 82.8535 | 79.2318 | | XGLMForCausalLM | 8 | 3.179 | 13.6261 | 28.1182 | nan | 81.5125 | 79.9322 | | M2M100ForConditionalGeneration | 8 | 4.2771 | 15.8666 | 30.3578 | 424.5895 | 74.5191 | 70.9121 | | MBartForConditionalGeneration | 16 | 4.0745 | 17.4895 | 30.0541 | nan | 60.9653 | 59.3313 | | PegasusForConditionalGeneration | 16 | 3.8294 | 17.1739 | 27.4618 | 456.3515 | 60.6207 | 55.9846 | | BartForConditionalGeneration | 2 | 4.0219 | 17.3592 | nan | nan | 59.9891 | 57.8295 | | YituTechConvBert | 1 | 2.8404 | 11.0009 | 16.5533 | nan | 52.7953 | 48.6038 | | MegatronBertForCausalLM | 16 | 4.1548 | 14.6527 | 22.9506 | nan | 48.7612 | 46.4789 | | MegatronBertForQuestionAnswering | 16 | 3.9894 | 14.5425 | 22.8854 | nan | 47.1127 | 45.9904 | | MT5ForConditionalGeneration | 8 | 4.0593 | 13.2671 | 21.6436 | 182.287 | 44.9256 | 42.6721 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4997 | 11.5858 | 18.7286 | nan | 40.9287 | 39.0688 | | T5Small | 1 | 2.6591 | 9.1279 | 13.151 | 109.2223 | 33.6601 | 32.7428 | | T5ForConditionalGeneration | 4 | 2.6646 | 9.0338 | 13.4935 | 112.8446 | 33.5673 | 32.4119 | | PLBartForConditionalGeneration | 16 | 2.1074 | 8.7542 | 13.4868 | nan | 33.5127 | 33.565 | | LayoutLMForSequenceClassification | 16 | 2.3135 | 7.7566 | 11.8572 | nan | 31.3464 | 29.3416 | | ElectraForCausalLM | 32 | 2.0451 | 7.4345 | 11.4587 | nan | 30.7162 | 28.5186 | | PegasusForCausalLM | 32 | 1.5579 | 6.588 | 10.2721 | 137.4212 | 26.5697 | 24.9646 | | LayoutLMForMaskedLM | 16 | 2.4908 | 7.7934 | 12.034 | nan | 26.472 | 24.7561 | | MBartForCausalLM | 32 | 1.4904 | 6.6255 | 10.1448 | nan | 25.2115 | 23.8657 | | RobertaForCausalLM | 64 | 1.8812 | 7.3097 | 10.4825 | nan | 24.9204 | 24.3817 | | BertForMaskedLM | 64 | 1.8909 | 7.1967 | 11.0071 | nan | 24.4951 | 23.6636 | | ElectraForQuestionAnswering | 64 | 2.001 | 7.3002 | 10.797 | nan | 24.4523 | 23.0111 | | OPTForCausalLM | 32 | 1.5718 | 7.2784 | 11.4358 | 131.0921 | 24.0511 | 22.5163 | | TrOCRForCausalLM | 32 | 1.4793 | 6.6125 | 9.8507 | nan | 23.9797 | 23.0131 | | BartForCausalLM | 4 | 1.5506 | 6.6132 | 9.8652 | nan | 23.7612 | 22.66 | | BertForQuestionAnswering | 128 | 1.8734 | 7.2512 | 11.0258 | nan | 23.5406 | 22.859 | | RobertaForQuestionAnswering | 128 | 1.9098 | 7.1241 | 10.5937 | nan | 22.7278 | 21.4792 | | CamemBert | 1 | 1.9359 | 7.5018 | 10.3727 | nan | 21.8414 | 20.8427 | | AlbertForMaskedLM | 4 | 1.7175 | 7.3031 | nan | nan | 21.1479 | 20.3657 | | AlbertForQuestionAnswering | 4 | 1.8347 | 7.0632 | nan | nan | 20.5988 | 19.4607 | | GPT2ForSequenceClassification | 4 | 1.8037 | 6.5065 | nan | 110.2534 | 19.9998 | 19.5073 | | BlenderbotSmallForCausalLM | 64 | 1.0406 | 4.4795 | 6.8495 | nan | 17.6594 | 16.8046 | | Speech2Text2ForCausalLM | 128 | 0.9075 | 3.4969 | 5.4033 | 64.0122 | 16.2453 | 14.7561 | | PLBartForCausalLM | 32 | 0.8422 | 3.4917 | 4.9935 | 75.2534 | 15.1231 | 15.0981 | | DistilBertForMaskedLM | 64 | 0.8394 | 3.5858 | 6.2482 | 62.5082 | 14.807 | 14.1055 | | DistilBertForQuestionAnswering | 64 | 0.8397 | 3.7957 | 5.8508 | 68.829 | 14.2802 | 13.5993 | | DistillGPT2 | 1 | 0.9719 | 3.3796 | 4.7856 | nan | 14.0105 | 13.6432 | | BigBird | 1 | 4.0268 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.1872 | 1.0783 | 1.1717 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9973 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1 | 0.9327 | 0.9847 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | nan | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.0551 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 0.9999 | 0.8994 | 0.3788 | nan | 0.8828 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3725 | 1.0333 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4146 | nan | 0.8523 | 0.9876 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.3635 | 1.0792 | 0.8215 | 0.8801 | | CamemBert | 1 | 0.999 | 0.8143 | 0.4159 | nan | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8055 | 0.9516 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.4021 | nan | 0.8048 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | 0.4317 | nan | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9674 | 0.4275 | 1.0461 | 0.752 | 0.9892 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3622 | nan | 0.409 | 1.026 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3251 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.9748 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.4648 | 301.2449 | nan | nan | 163.2613 | 163.9857 | | AlbertForQuestionAnswering | 4 | 264.314 | 298.5267 | nan | nan | 160.8391 | 161.5659 | | BartForConditionalGeneration | 2 | 135.7444 | 140.5032 | nan | nan | 95.6537 | 97.8556 | | BlenderbotSmallForConditionalGeneration | 64 | 109.2364 | 127.0885 | 151.5615 | nan | 79.9387 | 79.588 | | BartForCausalLM | 4 | 111.9369 | 115.5414 | 147.9002 | nan | 79.15 | 79.0943 | | BertForQuestionAnswering | 128 | 110.4708 | 113.2358 | 142.0924 | nan | 76.9385 | 78.3261 | | RobertaForQuestionAnswering | 128 | 110.9423 | 112.6007 | 142.3053 | nan | 76.8231 | 78.8463 | | LayoutLMForMaskedLM | 16 | 111.9368 | 115.4 | 148.1047 | nan | 70.2275 | 70.8414 | | MBartForConditionalGeneration | 16 | 103.2824 | 126.8209 | 114.4297 | nan | 66.9643 | 70.8351 | | PegasusForConditionalGeneration | 16 | 104.1201 | 126.843 | 112.8051 | 164.4206 | 66.8 | 72.9854 | | DebertaForQuestionAnswering | 8 | 76.1169 | 86.5159 | 103.9189 | nan | 66.1531 | 61.7785 | | T5ForConditionalGeneration | 4 | 100.9954 | 112.8121 | 134.1462 | 86.6883 | 63.5187 | 64.378 | | PegasusForCausalLM | 32 | 68.7242 | 72.7706 | 91.5254 | 81.8106 | 60.5768 | 60.3738 | | MBartForCausalLM | 32 | 69.6191 | 74.0819 | 92.28 | nan | 59.9933 | 59.9371 | | TrOCRForCausalLM | 32 | 69.6037 | 75.1835 | 91.9451 | nan | 59.9421 | 59.9351 | | BertForMaskedLM | 64 | 75.4725 | 78.9032 | 101.889 | nan | 58.1885 | 58.7378 | | RobertaForCausalLM | 64 | 80.2354 | 83.6752 | 106.5029 | nan | 57.4648 | 58.2262 | | ElectraForQuestionAnswering | 64 | 114.7386 | 116.8161 | 149.0575 | nan | 56.3347 | 57.8761 | | LayoutLMForSequenceClassification | 16 | 97.1061 | 99.1705 | 125.3791 | nan | 54.1191 | 55.5783 | | MobileBertForQuestionAnswering | 64 | 190.5361 | 246.6218 | 118.0948 | nan | 53.3437 | 105.1289 | | XGLMForCausalLM | 8 | 87.3977 | 107.6528 | 93.7352 | nan | 52.8369 | 63.9088 | | M2M100ForConditionalGeneration | 8 | 124.6816 | 120.6299 | 88.5735 | 154.6503 | 50.6169 | 76.6523 | | DebertaForMaskedLM | 4 | 75.1184 | 97.4156 | 78.3828 | nan | 50.3674 | 56.7563 | | ElectraForCausalLM | 32 | 87.5247 | 93.7665 | 122.1338 | nan | 49.8239 | 49.7113 | | BlenderbotSmallForCausalLM | 64 | 58.6216 | 63.6498 | 81.5604 | nan | 48.3584 | 48.0312 | | MegatronBertForCausalLM | 16 | 87.7817 | 96.1121 | 83.9011 | nan | 47.167 | 57.5141 | | MobileBertForMaskedLM | 32 | 214.0348 | 241.6149 | 110.1628 | nan | 43.5724 | 101.4571 | | MegatronBertForQuestionAnswering | 16 | 79.9413 | 97.1358 | 76.7106 | nan | 43.4894 | 47.403 | | GPT2ForSequenceClassification | 4 | 91.9111 | 93.5004 | nan | 179.6119 | 39.0465 | 39.8145 | | T5Small | 1 | 63.1919 | 73.9268 | 53.1865 | 71.6808 | 38.9087 | 48.7533 | | DistilBertForMaskedLM | 64 | 45.0861 | 48.1106 | 63.7348 | 98.0482 | 37.2482 | 37.3007 | | OPTForCausalLM | 32 | 53.6738 | 58.4399 | 69.8753 | 159.2738 | 35.5267 | 35.821 | | PLBartForCausalLM | 32 | 39.0895 | 41.7897 | 49.4286 | 46.4865 | 31.6408 | 31.7126 | | PLBartForConditionalGeneration | 16 | 55.6642 | 66.8187 | 53.2622 | nan | 30.5678 | 34.4809 | | MT5ForConditionalGeneration | 8 | 104.1116 | 122.9308 | 57.7193 | 102.1241 | 26.588 | 37.1221 | | DistilBertForQuestionAnswering | 64 | 30.5677 | 33.1067 | 41.1162 | 84.0993 | 21.0854 | 21.7901 | | Speech2Text2ForCausalLM | 128 | 30.3003 | 32.4641 | 42.2807 | 37.5361 | 20.5193 | 20.4287 | | YituTechConvBert | 1 | 62.0851 | 74.0989 | 27.1879 | nan | 13.8072 | 39.9998 | | CamemBert | 1 | 37.0307 | 46.364 | 21.9437 | nan | 11.158 | 22.7364 | | DistillGPT2 | 1 | 20.2655 | 23.6782 | 15.9943 | nan | 8.0009 | 10.8269 | | BigBird | 1 | 192.3145 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | regnety_002 | 128 | 0.9781 | 0.9404 | 1.1136 | 0.8617 | 2.1425 | 1.4351 | | ghostnet_100 | 128 | 1.0033 | 0.9796 | 0.8937 | 0.9925 | 2.1277 | 1.7897 | | xcit_large_24_p8_224 | 5 | 1.0008 | 0.0 | 0.0 | 0.0 | 2.1168 | 1.8655 | | lcnet_050 | 128 | 0.9658 | 0.947 | 0.8468 | 1.0335 | 2.0285 | 1.6218 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9969 | 0.0 | 0.0 | 1.9232 | 1.8934 | | twins_pcpvt_base | 64 | 1.0062 | 0.93 | 0.9617 | 0.0 | 1.756 | 1.64 | | hrnet_w18 | 128 | 1.0034 | 1.0277 | 0.8658 | 0.0 | 1.6901 | 1.4398 | | res2net101_26w_4s | 64 | 1.0038 | 1.0123 | 0.9467 | 0.0 | 1.6128 | 1.3283 | | coat_lite_mini | 128 | 1.0 | 0.9885 | 0.8421 | 1.1522 | 1.5891 | 1.5719 | | dla102 | 128 | 1.0 | 0.9958 | 0.8306 | 1.3151 | 1.5816 | 1.5483 | | nfnet_l0 | 128 | 0.999 | 0.8101 | 0.7108 | 0.8479 | 1.558 | 1.4681 | | volo_d1_224 | 64 | 0.9999 | 0.9938 | 0.839 | 0.0 | 1.5526 | 1.5209 | | resnest101e | 64 | 1.0036 | 0.991 | 0.8138 | 0.0 | 1.5479 | 1.5026 | | gmlp_s16_224 | 128 | 0.9999 | 0.9956 | 0.7866 | 1.0145 | 1.5229 | 1.5014 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 0.8543 | 1.1415 | 1.5057 | 1.4717 | | adv_inception_v3 | 128 | 0.9999 | 0.9964 | 0.8533 | 1.1424 | 1.5034 | 1.464 | | inception_v3 | 128 | 0.9998 | 0.9965 | 0.8532 | 1.1417 | 1.5005 | 1.4662 | | dm_nfnet_f0 | 128 | 0.9984 | 0.9993 | 0.8805 | 0.9227 | 1.5002 | 1.4296 | | gmixer_24_224 | 128 | 0.9999 | 0.8807 | 0.7214 | 0.9232 | 1.4936 | 1.4814 | | res2net50_14w_8s | 128 | 1.0001 | 0.9927 | 0.8097 | 0.9912 | 1.4852 | 1.4124 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9588 | 0.0 | 0.0 | 1.4813 | 1.4135 | | mobilenetv3_large_100 | 128 | 0.9531 | 0.9449 | 0.7832 | 0.9312 | 1.4485 | 1.4297 | | selecsls42b | 128 | 0.9999 | 0.9956 | 0.8424 | 1.2844 | 1.443 | 1.4108 | | res2next50 | 128 | 0.9994 | 0.9953 | 0.8336 | 1.1382 | 1.4175 | 1.3462 | | mnasnet_100 | 128 | 0.9535 | 0.9431 | 0.7895 | 1.1803 | 1.416 | 1.4608 | | cait_m36_384 | 4 | 1.0005 | 1.0096 | 0.0 | 0.0 | 1.4152 | 1.3657 | | fbnetv3_b | 128 | 0.9526 | 0.9397 | 0.7747 | 0.0 | 1.4041 | 1.3937 | | mobilenetv2_100 | 128 | 0.951 | 0.9421 | 0.7223 | 1.1218 | 1.4007 | 1.4335 | | crossvit_9_240 | 128 | 1.0001 | 0.9942 | 0.8382 | 0.9173 | 1.3954 | 1.3682 | | convit_base | 64 | 1.0 | 0.9968 | 0.8322 | 1.2379 | 1.3906 | 1.3175 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9642 | 0.7679 | 1.1266 | 1.3718 | 1.3793 | | mobilevit_s | 64 | 0.9732 | 0.8144 | 0.6562 | 0.0 | 1.3608 | 1.3593 | | jx_nest_base | 32 | 1.0 | 0.9925 | 0.7963 | 0.0 | 1.3602 | 1.3268 | | fbnetc_100 | 128 | 0.9523 | 0.9398 | 0.7932 | 1.1204 | 1.3521 | 1.3732 | | spnasnet_100 | 128 | 0.9461 | 0.936 | 0.778 | 1.0918 | 1.3507 | 1.3272 | | resmlp_12_224 | 128 | 1.0 | 0.9986 | 0.7831 | 1.4885 | 1.3303 | 1.2978 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.8072 | 0.0 | 1.326 | 1.2952 | | tf_efficientnet_b0 | 128 | 0.9652 | 0.8074 | 0.6667 | 0.9502 | 1.3246 | 1.3554 | | botnet26t_256 | 128 | 0.9783 | 0.9733 | 0.8124 | 1.2779 | 1.3236 | 1.3302 | | pit_b_224 | 64 | 0.9998 | 0.9953 | 0.8207 | 0.9715 | 1.3156 | 1.3091 | | pnasnet5large | 16 | 1.0051 | 1.0406 | 0.8454 | 0.0 | 1.3115 | 1.2719 | | cspdarknet53 | 64 | 0.9431 | 0.9343 | 0.7569 | 1.0914 | 1.3027 | 1.3242 | | rexnet_100 | 128 | 0.9656 | 0.8497 | 0.6913 | 0.0 | 1.2723 | 1.2774 | | tinynet_a | 128 | 0.9723 | 0.8029 | 0.6588 | 0.7806 | 1.2714 | 1.3288 | | eca_botnext26ts_256 | 128 | 0.9801 | 0.8115 | 0.6714 | 1.072 | 1.2712 | 1.2678 | | mixer_b16_224 | 128 | 0.9999 | 0.9976 | 0.8028 | 0.9024 | 1.2593 | 1.2499 | | beit_base_patch16_224 | 64 | 1.0 | 0.9785 | 0.0 | 0.0 | 1.2465 | 1.2307 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.9913 | 0.7969 | 0.9754 | 1.2391 | 1.222 | | visformer_small | 128 | 0.9996 | 0.999 | 0.8425 | 0.0 | 1.231 | 1.1753 | | dpn107 | 32 | 0.9569 | 0.9281 | 0.7566 | 0.0 | 1.2072 | 1.183 | | sebotnet33ts_256 | 64 | 0.9657 | 0.8369 | 0.6797 | 0.9712 | 1.2037 | 1.1982 | | tf_mixnet_l | 128 | 0.9785 | 0.9092 | 0.7936 | 0.0 | 1.1794 | 1.1732 | | mixnet_l | 128 | 0.9797 | 0.9055 | 0.7949 | 0.0 | 1.1618 | 1.1555 | | gluon_xception65 | 32 | 0.9996 | 0.99 | 0.7474 | 0.0 | 1.159 | 1.1246 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 0.8311 | 0.9109 | 1.1576 | 1.1465 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.9815 | 0.8092 | 0.0 | 1.1355 | 1.0556 | | repvgg_a2 | 128 | 0.9426 | 0.9346 | 0.7987 | 1.0684 | 1.1034 | 1.1196 | | gernet_l | 128 | 0.947 | 0.9378 | 0.7679 | 1.063 | 1.0641 | 1.0776 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9233 | 0.0 | 1.056 | 1.0506 | | convnext_base | 64 | 0.9995 | 0.9953 | 0.8004 | 0.0 | 0.6631 | 0.6452 | | eca_halonext26ts | 128 | 0.9813 | 0.8163 | 0.679 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.931 | 30.4857 | 57.3067 | nan | 150.2292 | 136.4794 | | twins_pcpvt_base | 64 | 2.9951 | 15.3979 | 26.8133 | nan | 130.834 | 129.5663 | | pnasnet5large | 16 | 5.6391 | 23.8783 | 41.1797 | nan | 92.764 | 87.4451 | | xcit_large_24_p8_224 | 5 | 3.5596 | nan | nan | nan | 92.0008 | 88.4883 | | cait_m36_384 | 4 | 3.815 | 19.5988 | nan | nan | 86.6496 | 82.3462 | | swin_base_patch4_window7_224 | 64 | 3.2903 | 13.3388 | nan | nan | 82.786 | 80.3175 | | resnest101e | 64 | 3.7295 | 16.5156 | 27.453 | nan | 79.7768 | 72.6928 | | convnext_base | 64 | 1.5651 | 6.9611 | 11.4667 | nan | 77.0414 | 72.2102 | | mobilevit_s | 64 | 2.0202 | 7.6097 | 15.5645 | nan | 71.0608 | 67.871 | | jx_nest_base | 32 | 2.0348 | 9.2647 | 16.102 | nan | 65.5891 | 63.0269 | | res2net101_26w_4s | 64 | 3.5565 | 16.9651 | 28.2784 | nan | 64.5817 | 60.2342 | | coat_lite_mini | 128 | 1.3165 | 5.4821 | 8.4032 | 113.7861 | 61.5139 | 59.5614 | | res2net50_14w_8s | 128 | 3.1746 | 14.6573 | 24.9225 | 338.0982 | 57.6172 | 53.9023 | | poolformer_m36 | 64 | 1.9082 | 7.4302 | 12.2258 | nan | 56.0511 | 52.1659 | | sebotnet33ts_256 | 64 | 1.9509 | 6.2361 | 13.7414 | 150.4399 | 48.1373 | 46.034 | | gmlp_s16_224 | 128 | 1.4987 | 7.4523 | 12.3432 | 197.9731 | 47.2443 | 44.1018 | | dpn107 | 32 | 4.306 | 13.9007 | 39.7645 | nan | 47.0541 | 43.9241 | | crossvit_9_240 | 128 | 1.872 | 8.655 | 13.5658 | 211.5802 | 45.8403 | 43.3586 | | fbnetv3_b | 128 | 3.531 | 11.7421 | 28.2677 | nan | 45.6063 | 42.7345 | | gluon_xception65 | 32 | 2.3146 | 11.0104 | 18.8315 | nan | 45.2269 | 42.718 | | volo_d1_224 | 64 | 1.4525 | 7.6563 | 12.9226 | nan | 45.068 | 42.3737 | | tnt_s_patch16_224 | 128 | 2.0252 | 11.3096 | nan | nan | 43.6365 | 40.114 | | gluon_inception_v3 | 128 | 1.8479 | 8.4126 | 13.8175 | 190.1402 | 39.904 | 36.691 | | eca_botnext26ts_256 | 128 | 1.5477 | 5.0427 | 10.481 | 124.6177 | 39.792 | 39.2614 | | inception_v3 | 128 | 1.8263 | 8.4564 | 13.5193 | 192.7768 | 39.4032 | 36.5655 | | dla102 | 128 | 2.1101 | 9.6008 | 15.9518 | 256.3065 | 39.3235 | 36.4951 | | ghostnet_100 | 128 | 3.3877 | 9.9212 | 14.8002 | 199.2491 | 39.1227 | 36.6041 | | adv_inception_v3 | 128 | 1.8288 | 8.4327 | 13.5214 | 189.3149 | 38.6505 | 37.1302 | | gmixer_24_224 | 128 | 1.6172 | 8.3054 | 13.7966 | 188.9793 | 38.152 | 35.4031 | | tf_mixnet_l | 128 | 6.2003 | 12.9642 | 27.2885 | nan | 37.9262 | 36.0038 | | swsl_resnext101_32x16d | 32 | 2.196 | 9.2607 | 14.7766 | nan | 37.2668 | 34.7827 | | mixnet_l | 128 | 5.7372 | 12.878 | 26.4945 | nan | 37.1291 | 35.036 | | botnet26t_256 | 128 | 1.5761 | 4.4983 | 9.279 | 94.729 | 35.0677 | 34.1297 | | dm_nfnet_f0 | 128 | 2.3046 | 7.4564 | 11.0241 | 165.0416 | 33.9742 | 32.2667 | | res2next50 | 128 | 1.7858 | 8.2631 | 13.0833 | 205.2612 | 32.7063 | 30.2447 | | convit_base | 64 | 1.3665 | 6.2292 | 9.8919 | 148.4027 | 31.9169 | 30.7947 | | tinynet_a | 128 | 2.3455 | 8.1442 | 19.9872 | 202.1464 | 31.7491 | 30.1005 | | rexnet_100 | 128 | 2.1214 | 7.4928 | 17.1434 | nan | 31.5534 | 29.7933 | | tf_efficientnet_b0 | 128 | 2.0551 | 7.0695 | 16.2992 | 184.1684 | 27.8237 | 25.4269 | | cspdarknet53 | 64 | 2.6122 | 7.5394 | 18.6644 | 152.8653 | 27.1407 | 25.0663 | | spnasnet_100 | 128 | 2.3143 | 6.7729 | 17.1801 | 136.5093 | 26.5412 | 24.7965 | | mixer_b16_224 | 128 | 0.8987 | 3.7968 | 5.9729 | 87.4798 | 26.3356 | 25.4143 | | fbnetc_100 | 128 | 2.3512 | 7.072 | 17.4926 | 139.6215 | 25.8561 | 24.325 | | convmixer_768_32 | 32 | 1.3946 | 6.5936 | 9.9769 | nan | 25.7018 | 24.5606 | | pit_b_224 | 64 | 1.248 | 5.4003 | 8.8056 | 109.6848 | 25.1364 | 23.8347 | | deit_base_distilled_patch16_224 | 64 | 1.0536 | 5.3764 | 7.4986 | 88.2815 | 25.1177 | 25.2995 | | visformer_small | 128 | 1.0378 | 4.2325 | 6.5274 | nan | 25.1067 | 23.9944 | | vit_base_patch16_224 | 64 | 1.1538 | 4.7124 | 8.0766 | 90.9786 | 24.8839 | 23.7608 | | nfnet_l0 | 128 | 2.0544 | 7.5252 | 10.9787 | 150.1953 | 24.7692 | 22.9685 | | resmlp_12_224 | 128 | 0.7995 | 3.1912 | 4.872 | 50.0284 | 24.6682 | 22.7338 | | mobilenetv3_large_100 | 128 | 1.8934 | 5.835 | 13.4829 | 146.7477 | 23.9546 | 23.1319 | | beit_base_patch16_224 | 64 | 1.4003 | 5.8776 | nan | nan | 23.4187 | 21.9121 | | mobilenetv2_100 | 128 | 1.9196 | 5.658 | 12.9948 | 117.1492 | 22.5906 | 21.5039 | | repvgg_a2 | 128 | 2.1604 | 6.1376 | 15.4625 | 200.9345 | 22.2589 | 21.2079 | | mnasnet_100 | 128 | 1.8818 | 5.5318 | 13.3177 | 109.261 | 21.8075 | 19.8951 | | regnety_002 | 128 | 1.7886 | 5.8636 | 13.114 | 118.7427 | 21.8 | 20.2519 | | gernet_l | 128 | 2.1647 | 6.2133 | 15.5146 | 115.6391 | 20.999 | 19.9016 | | selecsls42b | 128 | 0.9436 | 3.8595 | 5.9259 | 91.239 | 18.5606 | 17.3978 | | lcnet_050 | 128 | 1.1515 | 3.4232 | 7.5048 | 83.467 | 15.2332 | 14.6341 | | ese_vovnet19b_dw | 128 | 1.1361 | 3.1755 | 6.8034 | 68.4644 | 14.4607 | 13.635 | | eca_halonext26ts | 128 | 1.6025 | 5.1343 | 11.0416 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9937 | 0.9715 | 0.3561 | 1.3557 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2666 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.1605 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 1.0332 | 1.1822 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3529 | nan | 0.9069 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2717 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.806 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.745 | 0.8294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.486 | 296.8557 | 321.1375 | nan | 280.7942 | 282.1234 | | tnt_s_patch16_224 | 128 | 363.6214 | 364.7147 | nan | nan | 189.0574 | 191.9649 | | hrnet_w18 | 128 | 297.7562 | 289.8731 | 344.5862 | nan | 188.7442 | 221.1564 | | convnext_base | 64 | 121.4143 | 121.6429 | 151.4207 | nan | 183.0963 | 187.5732 | | pnasnet5large | 16 | 229.4869 | 221.4024 | 257.83 | nan | 168.9324 | 173.7203 | | tf_mixnet_l | 128 | 195.1447 | 210.0817 | 240.6266 | nan | 162.2243 | 162.9431 | | mixnet_l | 128 | 186.6718 | 201.9991 | 230.0394 | nan | 157.43 | 158.2623 | | convit_base | 64 | 181.2822 | 181.7059 | 217.6003 | 146.3074 | 130.2669 | 137.5216 | | pit_b_224 | 64 | 154.8196 | 155.4562 | 188.2924 | 159.2155 | 117.5385 | 118.1064 | | cait_m36_384 | 4 | 165.9859 | 164.6621 | nan | nan | 117.3606 | 121.8195 | | dla102 | 128 | 178.2148 | 179.1855 | 214.8749 | 135.5781 | 112.827 | 115.1884 | | poolformer_m36 | 64 | 148.8974 | 149.0187 | 183.9342 | nan | 112.0757 | 114.8132 | | beit_base_patch16_224 | 64 | 134.9152 | 137.7806 | nan | nan | 108.2304 | 109.6662 | | resnest101e | 64 | 167.9436 | 165.3773 | 199.6766 | nan | 108.0546 | 113.5857 | | adv_inception_v3 | 128 | 160.9935 | 161.5713 | 188.5554 | 140.888 | 107.1038 | 109.8873 | | inception_v3 | 128 | 160.6292 | 161.0654 | 188.1764 | 140.9007 | 107.0841 | 109.5081 | | gluon_inception_v3 | 128 | 160.974 | 161.5206 | 188.646 | 141.1272 | 107.0629 | 109.3358 | | vit_base_patch16_224 | 64 | 120.4637 | 121.1913 | 144.9533 | 132.1297 | 104.0355 | 104.985 | | swsl_resnext101_32x16d | 32 | 117.7744 | 120.0279 | 145.97 | nan | 103.9766 | 111.3971 | | res2net50_14w_8s | 128 | 145.4328 | 146.8044 | 179.7227 | 147.1104 | 99.6889 | 104.0853 | | swin_base_patch4_window7_224 | 64 | 147.0818 | 153.2668 | nan | nan | 99.4041 | 104.0568 | | res2next50 | 128 | 138.6325 | 138.6725 | 166.1529 | 121.6646 | 97.7916 | 102.4728 | | mixer_b16_224 | 128 | 118.3458 | 118.6056 | 147.521 | 131.0367 | 94.006 | 94.6202 | | dpn107 | 32 | 114.1541 | 115.883 | 142.7404 | nan | 93.7816 | 91.7772 | | gmlp_s16_224 | 128 | 136.292 | 136.5303 | 173.1511 | 134.0163 | 89.4771 | 90.6161 | | jx_nest_base | 32 | 118.8976 | 119.7334 | 149.3025 | nan | 87.3509 | 89.5918 | | dm_nfnet_f0 | 128 | 131.6929 | 131.5566 | 148.9204 | 142.1526 | 87.1907 | 91.5997 | | volo_d1_224 | 64 | 134.5478 | 134.9864 | 160.1128 | nan | 86.5848 | 88.2192 | | eca_botnext26ts_256 | 128 | 112.1036 | 135.4767 | 163.5368 | 102.4249 | 86.3948 | 86.6135 | | gluon_xception65 | 32 | 97.8576 | 98.5746 | 130.6599 | nan | 84.3137 | 86.6696 | | fbnetv3_b | 128 | 120.8277 | 122.5649 | 148.5787 | nan | 83.0267 | 84.5648 | | gmixer_24_224 | 128 | 119.7908 | 136.0844 | 166.2186 | 129.8953 | 80.2727 | 80.8411 | | visformer_small | 128 | 98.1431 | 97.9784 | 116.7458 | nan | 79.8902 | 83.4777 | | botnet26t_256 | 128 | 106.0373 | 106.5229 | 127.725 | 81.1549 | 78.4519 | 77.8833 | | crossvit_9_240 | 128 | 109.2776 | 109.8997 | 130.2487 | 119.1501 | 78.2862 | 79.7036 | | res2net101_26w_4s | 64 | 121.7017 | 129.0133 | 126.692 | nan | 77.7325 | 95.1401 | | twins_pcpvt_base | 64 | 125.2206 | 143.6159 | 138.8939 | nan | 76.4556 | 81.9545 | | deit_base_distilled_patch16_224 | 64 | 94.1628 | 94.926 | 117.9446 | 96.3779 | 75.9289 | 76.9224 | | coat_lite_mini | 128 | 115.747 | 117.2487 | 137.6673 | 100.5884 | 72.9963 | 73.6657 | | gernet_l | 128 | 79.6333 | 80.5474 | 98.5914 | 71.0486 | 70.9857 | 70.0816 | | cspdarknet53 | 64 | 95.9161 | 96.6293 | 119.4499 | 82.8709 | 69.3552 | 68.1578 | | rexnet_100 | 128 | 90.8942 | 103.0498 | 127.0586 | nan | 68.8717 | 68.7081 | | repvgg_a2 | 128 | 79.6315 | 80.322 | 94.1804 | 70.3765 | 68.2501 | 67.1488 | | nfnet_l0 | 128 | 106.2833 | 131.0196 | 148.4146 | 124.5304 | 68.098 | 72.2292 | | sebotnet33ts_256 | 64 | 83.2625 | 96.032 | 118.2151 | 82.8439 | 66.8136 | 66.9768 | | tf_efficientnet_b0 | 128 | 90.5795 | 108.2595 | 131.1144 | 92.0013 | 65.9276 | 64.3744 | | mobilevit_s | 64 | 89.9782 | 107.4419 | 133.4305 | nan | 64.2514 | 64.3669 | | xcit_large_24_p8_224 | 5 | 128.6823 | nan | nan | nan | 62.0273 | 73.0838 | | fbnetc_100 | 128 | 87.9137 | 88.9833 | 105.5735 | 74.7723 | 61.9827 | 60.9264 | | tinynet_a | 128 | 75.7975 | 90.8109 | 110.6837 | 99.1569 | 58.0368 | 60.6362 | | spnasnet_100 | 128 | 76.555 | 77.3926 | 93.1575 | 66.2718 | 53.6136 | 54.5766 | | resmlp_12_224 | 128 | 68.1068 | 68.3201 | 87.2123 | 45.765 | 51.2553 | 52.5835 | | ese_vovnet19b_dw | 128 | 67.7937 | 68.2858 | 85.9112 | 58.4775 | 47.9989 | 47.7794 | | mnasnet_100 | 128 | 69.989 | 70.7781 | 84.6785 | 56.6127 | 47.2192 | 45.7042 | | ghostnet_100 | 128 | 95.9114 | 97.6094 | 107.551 | 101.2856 | 46.0617 | 54.3073 | | mobilenetv2_100 | 128 | 67.4917 | 68.2319 | 88.9909 | 57.3132 | 45.9347 | 44.8565 | | selecsls42b | 128 | 62.7781 | 62.9914 | 74.5785 | 48.8287 | 43.545 | 44.503 | | mobilenetv3_large_100 | 128 | 66.0257 | 66.6214 | 80.2942 | 68.1801 | 43.415 | 44.0017 | | regnety_002 | 128 | 57.9244 | 60.1452 | 46.9374 | 65.124 | 25.5235 | 37.5032 | | lcnet_050 | 128 | 34.0898 | 34.6161 | 38.6199 | 33.3388 | 16.3715 | 20.7263 | | eca_halonext26ts | 128 | 115.8373 | 139.2614 | 167.7038 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

williamwen42 commented 2 years ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 93%, 57/61  |
|     aot_cudagraphs     | 85%, 46/54 | 81%, 34/42  | 89%, 54/61  |
|    nvprims_nvfuser     | 59%, 32/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 90%, 38/42  | 90%, 55/61  |
| inductor_no_cudagraphs | 85%, 46/54 | 90%, 38/42  | 90%, 55/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.12x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.04x    |    1.08x    |
|        inductor        |   1.84x    |    1.74x    |    1.41x    |
| inductor_no_cudagraphs |   1.38x    |    1.53x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.06    |    2.84     |    2.33     |
|       aot_eager        |    6.61    |    10.24    |    8.69     |
|     aot_cudagraphs     |    9.51    |    16.50    |    16.36    |
|    nvprims_nvfuser     |   66.11    |   133.86    |   151.35    |
|        inductor        |   33.97    |    38.49    |    44.16    |
| inductor_no_cudagraphs |   34.21    |    33.58    |    41.73    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.83x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.96x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name: /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 81%, 44/54 | 81%, 44/54 | | inductor | huggingface | 87%, 39/45 | 87%, 39/45 | | inductor | timm_models | 89%, 54/61 | 89%, 54/61 | | inductor_no_cudagraphs | torchbench | 85%, 46/54 | 87%, 47/54 | | inductor_no_cudagraphs | huggingface | 91%, 41/45 | 91%, 41/45 | | inductor_no_cudagraphs | timm_models | 89%, 54/61 | 89%, 54/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.66x | 1.67x | | inductor | huggingface | 1.62x | 1.62x | | inductor | timm_models | 1.17x | 1.19x | | inductor_no_cudagraphs | torchbench | 1.29x | 1.28x | | inductor_no_cudagraphs | huggingface | 1.54x | 1.53x | | inductor_no_cudagraphs | timm_models | 1.15x | 1.16x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | vision_maskrcnn | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | BigBird | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | poolformer_m36 | fail_accuracy | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8633 | | torchbench | tacotron2 | 0.0 | 0.8824 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6631 | 0.6452 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 404.1995 | 416.489 | | torchbench | timm_efficientdet | 146.2678 | 144.8974 | | torchbench | hf_T5_large | 145.3088 | 139.5987 | | timm_models | hrnet_w18 | 150.2292 | 136.4794 | | timm_models | twins_pcpvt_base | 130.834 | 129.5663 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | speech_transformer | 0.8824 | 0.8866 | | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8778 | 1.0948 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0014 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4481 | 0.4691 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8828 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9876 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9516 | | huggingface | DistillGPT2 | 0.8048 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.752 | 0.9892 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.026 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | convnext_base | 0.806 | 0.9865 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8294 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Performance speedup regressions ~~~ +------------------------+-------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------+-------------+------------+ | inductor_no_cudagraphs | timm_vovnet | 0.9567 | 0.9018 | +------------------------+-------------+-------------+------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_324_20_11_22_performance_amp_450 Accuracy regressions ~~~ +------------------------+------------------+-------------+---------------+ | compiler | name | prev_status | cur_status | +------------------------+------------------+-------------+---------------+ | inductor | ese_vovnet19b_dw | pass | fail_accuracy | | inductor_no_cudagraphs | ese_vovnet19b_dw | pass | fail_accuracy | +------------------------+------------------+-------------+---------------+ ~~~ Performance speedup regressions ~~~ +------------------------+-----------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-----------------------+-------------+------------+ | inductor_no_cudagraphs | mobilenetv3_large_100 | 1.0811 | 0.9466 | +------------------------+-----------------------+-------------+------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 0.9269 | 2.4759 | 0.7336 | 6.1007 | 1.3179 | | functorch_dp_cifar10 | 64 | 1.0025 | 0.959 | 2.3644 | 0.0 | 5.0593 | 0.9792 | | timm_efficientdet | 1 | 0.9846 | 0.8224 | 2.1111 | 0.0 | 4.754 | 1.5319 | | resnext50_32x4d | 8 | 1.0029 | 0.9629 | 1.9044 | 0.7558 | 3.5498 | 1.2678 | | timm_vision_transformer | 8 | 1.0015 | 0.8456 | 1.8027 | 0.59 | 3.4415 | 1.532 | | BERT_pytorch | 16 | 1.0065 | 0.8313 | 1.5678 | 0.8309 | 3.366 | 2.332 | | mobilenet_v3_large | 32 | 1.0033 | 1.0061 | 1.6121 | 0.7691 | 3.0827 | 1.3913 | | drq | 1 | 1.0088 | 0.8228 | 1.9929 | 0.608 | 3.0015 | 1.1596 | | dcgan | 32 | 0.9819 | 0.9163 | 1.6644 | 0.7106 | 2.8668 | 1.0467 | | resnet18 | 16 | 1.0017 | 0.997 | 1.584 | 0.7957 | 2.8116 | 1.2074 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9956 | 0.976 | 1.7732 | 0.0 | 2.7857 | 1.5668 | | hf_T5_large | 2 | 1.0196 | 0.8562 | 0.0 | 0.0 | 2.6305 | 2.1346 | | mnasnet1_0 | 32 | 1.0 | 1.021 | 1.2678 | 0.7709 | 2.6232 | 1.3497 | | squeezenet1_1 | 32 | 0.9942 | 0.9626 | 1.4509 | 0.7253 | 2.4487 | 1.3039 | | hf_Albert | 8 | 1.0025 | 0.9621 | 0.7743 | 0.0 | 2.3629 | 2.2746 | | hf_GPT2 | 4 | 1.0238 | 0.9834 | 0.8156 | 0.2905 | 2.128 | 1.9203 | | pytorch_struct | 200 | 0.9858 | 0.7499 | 1.0158 | 0.5997 | 2.1278 | 1.28 | | timm_efficientnet | 32 | 0.9617 | 0.819 | 1.0779 | 0.6806 | 2.1064 | 1.2819 | | hf_Bert | 4 | 1.0358 | 0.8393 | 0.9547 | 0.0 | 2.0757 | 1.8356 | | lennard_jones | 1000 | 0.9695 | 0.7698 | 1.3011 | 0.4693 | 2.0722 | 1.0623 | | resnet152 | 32 | 1.0018 | 1.0101 | 1.2666 | 0.0 | 2.0638 | 1.3011 | | timm_resnest | 32 | 1.0068 | 1.0167 | 0.8369 | 0.9652 | 1.9156 | 1.6651 | | hf_T5 | 8 | 0.9997 | 0.919 | 0.0 | 1.3547 | 1.8668 | 1.8751 | | resnet50 | 32 | 1.0015 | 1.0246 | 1.0439 | 0.811 | 1.8012 | 1.3458 | | LearningToPaint | 96 | 1.003 | 1.0147 | 1.1631 | 0.8377 | 1.7935 | 1.3141 | | hf_Bart | 4 | 1.0128 | 0.8329 | 0.9446 | 0.0 | 1.758 | 1.8321 | | soft_actor_critic | 256 | 1.0176 | 0.7414 | 1.3388 | 0.5477 | 1.746 | 1.0551 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0223 | 0.9819 | 0.8605 | 1.703 | 1.4324 | | mobilenet_v2 | 96 | 1.0001 | 1.0065 | 0.7606 | 1.0345 | 1.5589 | 1.5181 | | speech_transformer | 32 | 0.9559 | 0.8244 | 1.7561 | 0.0 | 1.5304 | 1.5474 | | attention_is_all_you_need_pytorch | 256 | 1.0068 | 0.9027 | 0.8406 | 0.0 | 1.5285 | 1.58 | | timm_nfnet | 128 | 0.9991 | 1.0 | 0.8727 | 0.92 | 1.5078 | 1.4307 | | fastNLP_Bert | 6 | 0.9992 | 0.8893 | 0.7649 | 0.0 | 1.5043 | 1.4513 | | hf_DistilBert | 8 | 1.0017 | 0.9746 | 0.742 | 0.3688 | 1.492 | 1.4593 | | pytorch_stargan | 16 | 0.9951 | 1.0961 | 1.0396 | 0.0 | 1.4619 | 1.5082 | | pytorch_unet | 1 | 0.9996 | 0.9921 | 0.8639 | 1.0838 | 1.3621 | 1.331 | | timm_regnet | 32 | 0.9786 | 0.9422 | 0.9011 | 0.7826 | 1.3385 | 1.2223 | | timm_vovnet | 32 | 0.9205 | 0.8797 | 0.8693 | 0.7984 | 1.2996 | 1.1491 | | vgg16 | 64 | 0.9996 | 0.9972 | 0.8566 | 0.9734 | 1.2708 | 1.2639 | | Background_Matting | 4 | 0.9999 | 1.0155 | 0.8959 | 1.0571 | 1.2373 | 1.2197 | | Super_SloMo | 6 | 0.9993 | 0.995 | 0.8851 | 0.0 | 1.2277 | 1.1941 | | alexnet | 128 | 0.999 | 0.9977 | 0.815 | 0.928 | 1.2089 | 1.2102 | | hf_Reformer | 4 | 0.9987 | 1.0002 | 0.9928 | 0.6513 | 1.1761 | 1.1801 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9903 | 0.0 | 0.0 | 1.0903 | 1.0719 | | yolov3 | 16 | 0.9997 | 0.9906 | 0.8035 | 0.0 | 1.0881 | 1.0689 | | tts_angular | 64 | 0.975 | 0.9437 | 0.9749 | 0.9511 | 1.0167 | 1.0065 | | demucs | 4 | 1.0014 | 1.0 | 1.0002 | 0.998 | 1.0017 | 1.0006 | | nvidia_deeprecommender | 256 | 0.9989 | 0.996 | 0.697 | 1.0074 | 0.9892 | 1.0305 | | hf_GPT2_large | 4 | 1.0002 | 0.9907 | 0.0 | 0.0 | 0.0 | 1.8633 | | tacotron2 | 64 | 0.988 | 0.7645 | 0.9786 | 0.5994 | 0.0 | 0.8824 | | dlrm | 2048 | 1.01 | 1.1541 | 0.0 | 1.1273 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9843 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.1344 | 8.4555 | 11.8143 | nan | 404.1995 | 416.489 | | timm_efficientdet | 1 | 20.2634 | 39.3318 | 77.1986 | nan | 146.2678 | 144.8974 | | hf_T5_large | 2 | 14.8888 | 39.8919 | nan | nan | 145.3088 | 139.5987 | | timm_vision_transformer_large | 8 | 3.0569 | 15.4787 | nan | nan | 72.4614 | 69.0113 | | resnet152 | 32 | 2.7633 | 14.3372 | 22.2274 | nan | 53.5223 | 52.8041 | | densenet121 | 4 | 2.4205 | 12.1053 | 19.2315 | 234.7833 | 52.0325 | 51.0502 | | attention_is_all_you_need_pytorch | 256 | 1.4406 | 7.2285 | 11.5793 | nan | 40.3385 | 39.5388 | | timm_resnest | 32 | 0.6749 | 2.5525 | 3.8456 | 66.1754 | 39.2066 | 38.0779 | | speech_transformer | 32 | 2.0109 | 8.9765 | 34.3297 | nan | 36.4502 | 34.9328 | | hf_Bart | 4 | 2.0912 | 9.0258 | 14.3815 | nan | 36.1823 | 35.5936 | | timm_vision_transformer | 8 | 1.031 | 4.6098 | 6.8122 | 84.3136 | 35.9455 | 35.3652 | | BERT_pytorch | 16 | 1.8364 | 7.6958 | 11.5181 | 134.15 | 35.846 | 35.7492 | | fastNLP_Bert | 6 | 1.9116 | 7.2674 | 11.4915 | nan | 33.1912 | 30.7348 | | timm_nfnet | 128 | 2.2018 | 7.4185 | 11.3308 | 159.2067 | 32.7195 | 32.5884 | | hf_T5 | 8 | 2.7481 | 9.1221 | nan | 107.4401 | 32.4549 | 31.0626 | | timm_regnet | 32 | 2.4918 | 8.6327 | 19.8749 | 145.7126 | 28.8942 | 28.5276 | | pytorch_stargan | 16 | 0.4649 | 2.1492 | 2.9664 | nan | 28.1889 | 26.0531 | | timm_efficientnet | 32 | 1.9295 | 7.3003 | 15.603 | 151.8019 | 27.4539 | 27.061 | | mobilenet_v3_large | 32 | 1.0471 | 4.797 | 7.372 | 119.58 | 26.0092 | 25.9043 | | hf_Bert | 4 | 1.8867 | 7.2902 | 10.3758 | nan | 24.4481 | 23.5085 | | hf_Albert | 8 | 1.6492 | 6.7058 | 10.2512 | nan | 23.2417 | 22.1951 | | functorch_dp_cifar10 | 64 | 0.3445 | 1.4309 | 2.1635 | nan | 22.6126 | 22.7953 | | pytorch_struct | 200 | 0.2883 | 0.8641 | 1.6177 | 7.6025 | 22.4876 | 22.2684 | | mnasnet1_0 | 32 | 0.9474 | 4.3783 | 6.6271 | 88.0587 | 21.6773 | 21.1308 | | hf_GPT2 | 4 | 1.8656 | 6.5011 | 9.3689 | 114.2712 | 21.0139 | 20.0732 | | resnet50 | 32 | 1.0144 | 4.9032 | 6.8316 | 99.4446 | 20.7345 | 20.4962 | | shufflenet_v2_x1_0 | 128 | 1.1795 | 5.4288 | 7.7034 | 101.6202 | 20.552 | 20.3635 | | resnext50_32x4d | 8 | 1.0804 | 4.6139 | 6.8963 | 84.0624 | 20.3942 | 19.7537 | | timm_vovnet | 32 | 1.6063 | 4.5008 | 10.0066 | 72.0347 | 20.2442 | 19.9906 | | mobilenet_v2 | 96 | 0.9566 | 4.9474 | 7.0675 | 116.743 | 19.8935 | 19.3603 | | Background_Matting | 4 | 0.9599 | 4.4259 | 6.5969 | 96.3123 | 19.0115 | 17.7799 | | hf_Reformer | 4 | 1.6744 | 3.0553 | 5.483 | 17.8538 | 18.9639 | 16.267 | | Super_SloMo | 6 | 0.9908 | 4.0544 | 5.7403 | nan | 17.5075 | 16.591 | | hf_DistilBert | 8 | 0.8338 | 3.5538 | 5.7907 | 64.2335 | 15.5308 | 14.8387 | | resnet18 | 16 | 0.4733 | 1.8125 | 2.6291 | 38.0577 | 11.5643 | 11.5422 | | dcgan | 32 | 0.1827 | 0.4312 | 0.679 | 5.0555 | 10.388 | 9.9073 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4759 | 2.0248 | 2.852 | nan | 9.1949 | 9.0932 | | pytorch_unet | 1 | 0.4486 | 1.9193 | 2.7622 | 38.7551 | 8.4807 | 8.2167 | | LearningToPaint | 96 | 0.4988 | 1.9185 | 2.8943 | 47.227 | 8.2431 | 7.8678 | | squeezenet1_1 | 32 | 0.2749 | 0.9414 | 1.4095 | 6.9055 | 4.7654 | 4.5126 | | vgg16 | 64 | 0.209 | 0.6473 | 1.102 | 5.6309 | 4.2742 | 3.9492 | | drq | 1 | 0.3217 | 0.6423 | 1.0229 | 6.1416 | 4.2633 | 3.6368 | | nvidia_deeprecommender | 256 | 0.2211 | 0.5266 | 0.8912 | 5.6896 | 3.745 | 3.4994 | | soft_actor_critic | 256 | 0.2103 | 0.3601 | 0.5803 | 3.2728 | 3.5436 | 3.0174 | | alexnet | 128 | 0.1783 | 0.4468 | 0.7337 | 5.1929 | 3.325 | 3.3008 | | lennard_jones | 1000 | 0.1589 | 0.367 | 0.5531 | 2.942 | 2.3328 | 1.9799 | | tts_angular | 64 | 0.1937 | 0.2399 | 0.3659 | 1.5238 | 1.9197 | 1.7273 | | demucs | 4 | 0.3371 | 0.3585 | 0.3553 | 0.3639 | 0.2731 | 0.2673 | | hf_GPT2_large | 4 | 5.7771 | 20.2502 | nan | nan | nan | 58.0332 | | tacotron2 | 64 | 6.9867 | 20.1316 | 34.6561 | 91.2874 | nan | 45.901 | | dlrm | 2048 | 0.4851 | 0.8588 | nan | 4.3981 | nan | nan | | hf_BigBird | 2 | 4.0095 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2718 | 0.4638 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4815 | 1.0334 | 1.1302 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3079 | nan | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0142 | 0.9624 | 0.3723 | 0.9771 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.38 | 1.118 | 0.9649 | 1.1241 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9345 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.014 | 0.9304 | 1.2458 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3631 | nan | 0.9125 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.9063 | 1.0466 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3341 | nan | 0.8824 | 0.8866 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | nan | nan | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.3998 | 1.1039 | 0.8778 | 1.0948 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3561 | 0.7806 | 0.8659 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3414 | 1.0617 | 0.8348 | 0.9049 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3407 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3888 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.383 | 0.6701 | 0.7295 | 0.925 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3918 | 1.0881 | 0.7133 | 0.7227 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0014 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4481 | 0.4691 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3906 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 183.9264 | 185.8603 | nan | nan | 168.7355 | 171.7368 | | Background_Matting | 4 | 141.7648 | 131.3625 | 148.8745 | 126.0196 | 107.7583 | 109.2646 | | hf_T5 | 8 | 174.4926 | 189.4402 | nan | 128.4126 | 93.3232 | 92.963 | | hf_T5_large | 2 | 218.1989 | 260.3696 | nan | nan | 89.1603 | 110.8583 | | timm_nfnet | 128 | 131.8874 | 131.5784 | 149.7607 | 142.2406 | 87.2286 | 91.5348 | | hf_Reformer | 4 | 82.3598 | 82.1781 | 82.8209 | 126.1949 | 69.8371 | 69.6854 | | Super_SloMo | 6 | 79.0805 | 79.3464 | 89.5435 | nan | 64.5105 | 66.2058 | | yolov3 | 16 | 68.667 | 69.0193 | 85.3037 | nan | 62.9919 | 64.2431 | | demucs | 4 | 57.9343 | 57.1161 | 57.2196 | 57.1238 | 57.0935 | 57.2062 | | timm_regnet | 32 | 73.5289 | 81.4089 | 81.1558 | 91.6902 | 55.1698 | 60.0619 | | vgg16 | 64 | 66.2422 | 66.2093 | 77.0694 | 67.8099 | 52.002 | 52.3533 | | resnet152 | 32 | 91.0037 | 97.7275 | 73.4911 | nan | 45.7896 | 73.8826 | | speech_transformer | 32 | 65.2427 | 75.1714 | 34.8839 | nan | 41.5773 | 40.3135 | | fastNLP_Bert | 6 | 55.9758 | 62.4977 | 72.653 | nan | 37.2314 | 38.5491 | | timm_efficientdet | 1 | 163.1827 | 214.6085 | 76.5472 | nan | 36.1618 | 110.5349 | | attention_is_all_you_need_pytorch | 256 | 52.8984 | 59.2412 | 63.2279 | nan | 34.8035 | 37.186 | | hf_Bart | 4 | 55.5883 | 67.7852 | 65.7889 | nan | 33.957 | 36.148 | | mobilenet_v2 | 96 | 48.8565 | 49.4278 | 64.2011 | 47.2261 | 31.3401 | 32.1664 | | hf_Albert | 8 | 68.2827 | 72.0985 | 88.2802 | nan | 29.3207 | 29.982 | | pytorch_unet | 1 | 39.9271 | 40.1581 | 46.2402 | 36.8037 | 29.3201 | 29.9666 | | hf_GPT2 | 4 | 52.4292 | 49.6814 | 60.1295 | 168.5094 | 25.4594 | 25.8753 | | timm_vovnet | 32 | 34.752 | 38.1958 | 37.1268 | 40.731 | 24.8979 | 28.7185 | | shufflenet_v2_x1_0 | 128 | 42.876 | 42.1499 | 41.6597 | 49.9317 | 24.2456 | 29.1135 | | timm_efficientnet | 32 | 48.7395 | 61.4532 | 43.3363 | 69.7235 | 22.4523 | 37.767 | | hf_Bert | 4 | 40.6596 | 58.173 | 44.0914 | nan | 21.2743 | 23.4495 | | hf_DistilBert | 8 | 30.9806 | 31.8895 | 41.8606 | 84.3181 | 20.8157 | 21.2662 | | resnet50 | 32 | 33.7115 | 35.1441 | 32.3154 | 41.5451 | 19.3801 | 27.46 | | BERT_pytorch | 16 | 55.6925 | 66.4554 | 35.0948 | 66.3584 | 16.8192 | 24.9592 | | timm_resnest | 32 | 25.0597 | 24.8603 | 29.4839 | 25.4079 | 12.8525 | 15.7415 | | densenet121 | 4 | 72.9717 | 81.5106 | 29.9783 | 100.9377 | 12.6771 | 59.61 | | mobilenet_v3_large | 32 | 34.9903 | 34.941 | 24.01 | 47.3614 | 11.9799 | 26.5817 | | mnasnet1_0 | 32 | 28.9991 | 28.4173 | 23.1117 | 38.0174 | 11.4931 | 22.367 | | pytorch_stargan | 16 | 16.102 | 15.896 | 15.4703 | nan | 10.9192 | 11.5913 | | nvidia_deeprecommender | 256 | 10.3666 | 10.4037 | 14.8759 | 10.2899 | 10.4632 | 10.05 | | timm_vision_transformer | 8 | 33.921 | 34.6595 | 16.5079 | 50.2954 | 9.9535 | 20.4789 | | resnext50_32x4d | 8 | 33.0804 | 30.4899 | 15.5983 | 43.0998 | 8.4924 | 23.3554 | | LearningToPaint | 96 | 15.4426 | 14.8511 | 12.7876 | 18.0053 | 8.4605 | 11.4183 | | alexnet | 128 | 9.7884 | 9.8124 | 12.0045 | 10.5796 | 8.0901 | 8.1139 | | tts_angular | 64 | 6.9398 | 6.5844 | 6.4183 | 6.8377 | 6.7018 | 7.2245 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.098 | 18.5743 | 10.1651 | nan | 6.6768 | 11.9156 | | squeezenet1_1 | 32 | 15.1004 | 15.4611 | 10.1004 | 20.9457 | 6.215 | 11.7538 | | resnet18 | 16 | 12.9642 | 13.1091 | 8.0182 | 16.4295 | 4.7231 | 11.7877 | | functorch_dp_cifar10 | 64 | 14.21 | 15.0108 | 6.0075 | nan | 2.9591 | 15.0933 | | pytorch_struct | 200 | 4.6757 | 6.1055 | 4.513 | 7.7912 | 2.277 | 3.7498 | | drq | 1 | 3.8879 | 4.7963 | 1.9564 | 6.6998 | 1.3729 | 3.6489 | | dcgan | 32 | 3.1322 | 3.4376 | 1.8964 | 4.4751 | 1.107 | 2.9838 | | soft_actor_critic | 256 | 1.3741 | 1.8739 | 1.0807 | 2.8127 | 0.8557 | 1.4163 | | lennard_jones | 1000 | 1.4503 | 2.1461 | 1.1727 | 3.1974 | 0.749 | 1.4673 | | tacotron2 | 64 | 3526.5577 | 4226.6164 | 3367.1203 | 5061.5669 | nan | 3532.7074 | | hf_GPT2_large | 4 | 209.2206 | 211.7662 | nan | nan | nan | 112.3685 | | dlrm | 2048 | 501.5169 | 490.557 | nan | 499.3797 | nan | nan | | hf_BigBird | 2 | 195.5097 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0223 | 0.8377 | 2.3103 | 0.0 | 4.8405 | 1.6583 | | MobileBertForMaskedLM | 32 | 1.0172 | 0.8422 | 2.0319 | 0.0 | 4.1581 | 1.8028 | | CamemBert | 1 | 1.0447 | 0.8521 | 1.8013 | 0.0 | 3.7763 | 1.7973 | | MobileBertForQuestionAnswering | 64 | 1.0168 | 0.8377 | 1.5134 | 0.0 | 3.6592 | 1.7789 | | MT5ForConditionalGeneration | 8 | 1.0153 | 0.8552 | 1.5607 | 0.8664 | 3.4685 | 2.5255 | | DistillGPT2 | 1 | 1.0365 | 0.8788 | 1.4926 | 0.0 | 2.704 | 2.0011 | | GPT2ForSequenceClassification | 4 | 1.0029 | 0.9693 | 0.0 | 0.5045 | 2.3192 | 2.2924 | | M2M100ForConditionalGeneration | 8 | 1.0065 | 0.9218 | 1.2466 | 0.7002 | 2.2067 | 1.7105 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9797 | 0.7678 | 0.0 | 2.0342 | 1.9779 | | MegatronBertForQuestionAnswering | 16 | 1.0356 | 0.8521 | 1.0639 | 0.0 | 1.95 | 1.8031 | | PLBartForConditionalGeneration | 16 | 1.0125 | 0.8352 | 1.0355 | 0.0 | 1.8827 | 1.6882 | | MegatronBertForCausalLM | 16 | 1.0334 | 0.8527 | 0.9918 | 0.0 | 1.8022 | 1.7497 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9803 | 0.7756 | 0.0 | 1.7954 | 1.7491 | | ElectraForCausalLM | 32 | 0.9998 | 0.9298 | 0.7149 | 0.0 | 1.7505 | 1.7562 | | XGLMForCausalLM | 8 | 1.0122 | 0.8251 | 0.934 | 0.0 | 1.7391 | 1.7801 | | T5Small | 1 | 1.0264 | 0.9043 | 1.1552 | 0.8555 | 1.7388 | 1.5015 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8859 | 0.0 | 0.0 | 1.6477 | 1.6393 | | AlbertForMaskedLM | 4 | 1.0002 | 0.885 | 0.0 | 0.0 | 1.6361 | 1.6283 | | MBartForConditionalGeneration | 16 | 1.0151 | 0.8351 | 0.9222 | 0.0 | 1.6334 | 1.5862 | | PegasusForConditionalGeneration | 16 | 1.0127 | 0.8279 | 0.9093 | 0.6363 | 1.6253 | 1.529 | | LayoutLMForMaskedLM | 16 | 1.0008 | 0.9707 | 0.7557 | 0.0 | 1.606 | 1.5814 | | T5ForConditionalGeneration | 4 | 1.0079 | 0.9015 | 0.758 | 1.1634 | 1.6022 | 1.5676 | | OPTForCausalLM | 32 | 1.0068 | 0.9306 | 0.7722 | 0.3392 | 1.5325 | 1.5097 | | Speech2Text2ForCausalLM | 128 | 1.0069 | 0.9343 | 0.7224 | 0.8106 | 1.4927 | 1.4985 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9849 | 0.7793 | 0.0 | 1.4461 | 1.4066 | | DistilBertForQuestionAnswering | 64 | 1.0007 | 0.9477 | 0.7432 | 0.3628 | 1.442 | 1.3996 | | BertForQuestionAnswering | 128 | 1.0 | 0.9745 | 0.7777 | 0.0 | 1.4387 | 1.4119 | | BartForConditionalGeneration | 2 | 1.0045 | 0.9697 | 0.0 | 0.0 | 1.4202 | 1.3891 | | BartForCausalLM | 4 | 1.0011 | 0.9698 | 0.758 | 0.0 | 1.4151 | 1.4143 | | RobertaForCausalLM | 64 | 1.0004 | 0.9603 | 0.7542 | 0.0 | 1.4004 | 1.3807 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.8829 | 0.7443 | 0.0 | 1.379 | 1.3854 | | DebertaForMaskedLM | 4 | 0.9208 | 0.7366 | 0.8007 | 0.0 | 1.2999 | 1.1375 | | BertForMaskedLM | 64 | 1.0005 | 0.9564 | 0.7403 | 0.0 | 1.2988 | 1.2848 | | PLBartForCausalLM | 32 | 1.0067 | 0.9416 | 0.7926 | 0.8407 | 1.2218 | 1.2467 | | BlenderbotSmallForCausalLM | 64 | 1.0018 | 0.9261 | 0.718 | 0.0 | 1.2135 | 1.2264 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9392 | 0.7091 | 0.4614 | 1.2126 | 1.2118 | | MBartForCausalLM | 32 | 1.0036 | 0.9427 | 0.7569 | 0.0 | 1.1666 | 1.1628 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9485 | 0.7578 | 0.0 | 1.1621 | 1.1628 | | DebertaForQuestionAnswering | 8 | 0.9861 | 0.8674 | 0.7219 | 0.0 | 1.1368 | 1.211 | | PegasusForCausalLM | 32 | 0.9991 | 0.9505 | 0.7532 | 0.8471 | 1.1354 | 1.1366 | | BigBird | 1 | 0.978 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForMaskedLM | 4 | 5.3801 | 11.2863 | 35.4169 | nan | 105.4539 | 39.865 | | DebertaForQuestionAnswering | 8 | 5.2502 | 11.0802 | 36.1816 | nan | 103.2721 | 39.8822 | | MobileBertForMaskedLM | 32 | 10.1116 | 35.1002 | 58.9281 | nan | 84.8629 | 81.364 | | MobileBertForQuestionAnswering | 64 | 10.3878 | 35.1481 | 58.0209 | nan | 82.8535 | 79.2318 | | XGLMForCausalLM | 8 | 3.179 | 13.6261 | 28.1182 | nan | 81.5125 | 79.9322 | | M2M100ForConditionalGeneration | 8 | 4.2771 | 15.8666 | 30.3578 | 424.5895 | 74.5191 | 70.9121 | | MBartForConditionalGeneration | 16 | 4.0745 | 17.4895 | 30.0541 | nan | 60.9653 | 59.3313 | | PegasusForConditionalGeneration | 16 | 3.8294 | 17.1739 | 27.4618 | 456.3515 | 60.6207 | 55.9846 | | BartForConditionalGeneration | 2 | 4.0219 | 17.3592 | nan | nan | 59.9891 | 57.8295 | | YituTechConvBert | 1 | 2.8404 | 11.0009 | 16.5533 | nan | 52.7953 | 48.6038 | | MegatronBertForCausalLM | 16 | 4.1548 | 14.6527 | 22.9506 | nan | 48.7612 | 46.4789 | | MegatronBertForQuestionAnswering | 16 | 3.9894 | 14.5425 | 22.8854 | nan | 47.1127 | 45.9904 | | MT5ForConditionalGeneration | 8 | 4.0593 | 13.2671 | 21.6436 | 182.287 | 44.9256 | 42.6721 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4997 | 11.5858 | 18.7286 | nan | 40.9287 | 39.0688 | | T5Small | 1 | 2.6591 | 9.1279 | 13.151 | 109.2223 | 33.6601 | 32.7428 | | T5ForConditionalGeneration | 4 | 2.6646 | 9.0338 | 13.4935 | 112.8446 | 33.5673 | 32.4119 | | PLBartForConditionalGeneration | 16 | 2.1074 | 8.7542 | 13.4868 | nan | 33.5127 | 33.565 | | LayoutLMForSequenceClassification | 16 | 2.3135 | 7.7566 | 11.8572 | nan | 31.3464 | 29.3416 | | ElectraForCausalLM | 32 | 2.0451 | 7.4345 | 11.4587 | nan | 30.7162 | 28.5186 | | PegasusForCausalLM | 32 | 1.5579 | 6.588 | 10.2721 | 137.4212 | 26.5697 | 24.9646 | | LayoutLMForMaskedLM | 16 | 2.4908 | 7.7934 | 12.034 | nan | 26.472 | 24.7561 | | MBartForCausalLM | 32 | 1.4904 | 6.6255 | 10.1448 | nan | 25.2115 | 23.8657 | | RobertaForCausalLM | 64 | 1.8812 | 7.3097 | 10.4825 | nan | 24.9204 | 24.3817 | | BertForMaskedLM | 64 | 1.8909 | 7.1967 | 11.0071 | nan | 24.4951 | 23.6636 | | ElectraForQuestionAnswering | 64 | 2.001 | 7.3002 | 10.797 | nan | 24.4523 | 23.0111 | | OPTForCausalLM | 32 | 1.5718 | 7.2784 | 11.4358 | 131.0921 | 24.0511 | 22.5163 | | TrOCRForCausalLM | 32 | 1.4793 | 6.6125 | 9.8507 | nan | 23.9797 | 23.0131 | | BartForCausalLM | 4 | 1.5506 | 6.6132 | 9.8652 | nan | 23.7612 | 22.66 | | BertForQuestionAnswering | 128 | 1.8734 | 7.2512 | 11.0258 | nan | 23.5406 | 22.859 | | RobertaForQuestionAnswering | 128 | 1.9098 | 7.1241 | 10.5937 | nan | 22.7278 | 21.4792 | | CamemBert | 1 | 1.9359 | 7.5018 | 10.3727 | nan | 21.8414 | 20.8427 | | AlbertForMaskedLM | 4 | 1.7175 | 7.3031 | nan | nan | 21.1479 | 20.3657 | | AlbertForQuestionAnswering | 4 | 1.8347 | 7.0632 | nan | nan | 20.5988 | 19.4607 | | GPT2ForSequenceClassification | 4 | 1.8037 | 6.5065 | nan | 110.2534 | 19.9998 | 19.5073 | | BlenderbotSmallForCausalLM | 64 | 1.0406 | 4.4795 | 6.8495 | nan | 17.6594 | 16.8046 | | Speech2Text2ForCausalLM | 128 | 0.9075 | 3.4969 | 5.4033 | 64.0122 | 16.2453 | 14.7561 | | PLBartForCausalLM | 32 | 0.8422 | 3.4917 | 4.9935 | 75.2534 | 15.1231 | 15.0981 | | DistilBertForMaskedLM | 64 | 0.8394 | 3.5858 | 6.2482 | 62.5082 | 14.807 | 14.1055 | | DistilBertForQuestionAnswering | 64 | 0.8397 | 3.7957 | 5.8508 | 68.829 | 14.2802 | 13.5993 | | DistillGPT2 | 1 | 0.9719 | 3.3796 | 4.7856 | nan | 14.0105 | 13.6432 | | BigBird | 1 | 4.0268 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.1872 | 1.0783 | 1.1717 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9973 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1 | 0.9327 | 0.9847 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | nan | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.0551 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 0.9999 | 0.8994 | 0.3788 | nan | 0.8828 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3725 | 1.0333 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4146 | nan | 0.8523 | 0.9876 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.3635 | 1.0792 | 0.8215 | 0.8801 | | CamemBert | 1 | 0.999 | 0.8143 | 0.4159 | nan | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8055 | 0.9516 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.4021 | nan | 0.8048 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | 0.4317 | nan | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9674 | 0.4275 | 1.0461 | 0.752 | 0.9892 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3622 | nan | 0.409 | 1.026 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3251 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.9748 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.4648 | 301.2449 | nan | nan | 163.2613 | 163.9857 | | AlbertForQuestionAnswering | 4 | 264.314 | 298.5267 | nan | nan | 160.8391 | 161.5659 | | BartForConditionalGeneration | 2 | 135.7444 | 140.5032 | nan | nan | 95.6537 | 97.8556 | | BlenderbotSmallForConditionalGeneration | 64 | 109.2364 | 127.0885 | 151.5615 | nan | 79.9387 | 79.588 | | BartForCausalLM | 4 | 111.9369 | 115.5414 | 147.9002 | nan | 79.15 | 79.0943 | | BertForQuestionAnswering | 128 | 110.4708 | 113.2358 | 142.0924 | nan | 76.9385 | 78.3261 | | RobertaForQuestionAnswering | 128 | 110.9423 | 112.6007 | 142.3053 | nan | 76.8231 | 78.8463 | | LayoutLMForMaskedLM | 16 | 111.9368 | 115.4 | 148.1047 | nan | 70.2275 | 70.8414 | | MBartForConditionalGeneration | 16 | 103.2824 | 126.8209 | 114.4297 | nan | 66.9643 | 70.8351 | | PegasusForConditionalGeneration | 16 | 104.1201 | 126.843 | 112.8051 | 164.4206 | 66.8 | 72.9854 | | DebertaForQuestionAnswering | 8 | 76.1169 | 86.5159 | 103.9189 | nan | 66.1531 | 61.7785 | | T5ForConditionalGeneration | 4 | 100.9954 | 112.8121 | 134.1462 | 86.6883 | 63.5187 | 64.378 | | PegasusForCausalLM | 32 | 68.7242 | 72.7706 | 91.5254 | 81.8106 | 60.5768 | 60.3738 | | MBartForCausalLM | 32 | 69.6191 | 74.0819 | 92.28 | nan | 59.9933 | 59.9371 | | TrOCRForCausalLM | 32 | 69.6037 | 75.1835 | 91.9451 | nan | 59.9421 | 59.9351 | | BertForMaskedLM | 64 | 75.4725 | 78.9032 | 101.889 | nan | 58.1885 | 58.7378 | | RobertaForCausalLM | 64 | 80.2354 | 83.6752 | 106.5029 | nan | 57.4648 | 58.2262 | | ElectraForQuestionAnswering | 64 | 114.7386 | 116.8161 | 149.0575 | nan | 56.3347 | 57.8761 | | LayoutLMForSequenceClassification | 16 | 97.1061 | 99.1705 | 125.3791 | nan | 54.1191 | 55.5783 | | MobileBertForQuestionAnswering | 64 | 190.5361 | 246.6218 | 118.0948 | nan | 53.3437 | 105.1289 | | XGLMForCausalLM | 8 | 87.3977 | 107.6528 | 93.7352 | nan | 52.8369 | 63.9088 | | M2M100ForConditionalGeneration | 8 | 124.6816 | 120.6299 | 88.5735 | 154.6503 | 50.6169 | 76.6523 | | DebertaForMaskedLM | 4 | 75.1184 | 97.4156 | 78.3828 | nan | 50.3674 | 56.7563 | | ElectraForCausalLM | 32 | 87.5247 | 93.7665 | 122.1338 | nan | 49.8239 | 49.7113 | | BlenderbotSmallForCausalLM | 64 | 58.6216 | 63.6498 | 81.5604 | nan | 48.3584 | 48.0312 | | MegatronBertForCausalLM | 16 | 87.7817 | 96.1121 | 83.9011 | nan | 47.167 | 57.5141 | | MobileBertForMaskedLM | 32 | 214.0348 | 241.6149 | 110.1628 | nan | 43.5724 | 101.4571 | | MegatronBertForQuestionAnswering | 16 | 79.9413 | 97.1358 | 76.7106 | nan | 43.4894 | 47.403 | | GPT2ForSequenceClassification | 4 | 91.9111 | 93.5004 | nan | 179.6119 | 39.0465 | 39.8145 | | T5Small | 1 | 63.1919 | 73.9268 | 53.1865 | 71.6808 | 38.9087 | 48.7533 | | DistilBertForMaskedLM | 64 | 45.0861 | 48.1106 | 63.7348 | 98.0482 | 37.2482 | 37.3007 | | OPTForCausalLM | 32 | 53.6738 | 58.4399 | 69.8753 | 159.2738 | 35.5267 | 35.821 | | PLBartForCausalLM | 32 | 39.0895 | 41.7897 | 49.4286 | 46.4865 | 31.6408 | 31.7126 | | PLBartForConditionalGeneration | 16 | 55.6642 | 66.8187 | 53.2622 | nan | 30.5678 | 34.4809 | | MT5ForConditionalGeneration | 8 | 104.1116 | 122.9308 | 57.7193 | 102.1241 | 26.588 | 37.1221 | | DistilBertForQuestionAnswering | 64 | 30.5677 | 33.1067 | 41.1162 | 84.0993 | 21.0854 | 21.7901 | | Speech2Text2ForCausalLM | 128 | 30.3003 | 32.4641 | 42.2807 | 37.5361 | 20.5193 | 20.4287 | | YituTechConvBert | 1 | 62.0851 | 74.0989 | 27.1879 | nan | 13.8072 | 39.9998 | | CamemBert | 1 | 37.0307 | 46.364 | 21.9437 | nan | 11.158 | 22.7364 | | DistillGPT2 | 1 | 20.2655 | 23.6782 | 15.9943 | nan | 8.0009 | 10.8269 | | BigBird | 1 | 192.3145 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | regnety_002 | 128 | 0.9781 | 0.9404 | 1.1136 | 0.8617 | 2.1425 | 1.4351 | | ghostnet_100 | 128 | 1.0033 | 0.9796 | 0.8937 | 0.9925 | 2.1277 | 1.7897 | | xcit_large_24_p8_224 | 5 | 1.0008 | 0.0 | 0.0 | 0.0 | 2.1168 | 1.8655 | | lcnet_050 | 128 | 0.9658 | 0.947 | 0.8468 | 1.0335 | 2.0285 | 1.6218 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9969 | 0.0 | 0.0 | 1.9232 | 1.8934 | | twins_pcpvt_base | 64 | 1.0062 | 0.93 | 0.9617 | 0.0 | 1.756 | 1.64 | | hrnet_w18 | 128 | 1.0034 | 1.0277 | 0.8658 | 0.0 | 1.6901 | 1.4398 | | res2net101_26w_4s | 64 | 1.0038 | 1.0123 | 0.9467 | 0.0 | 1.6128 | 1.3283 | | coat_lite_mini | 128 | 1.0 | 0.9885 | 0.8421 | 1.1522 | 1.5891 | 1.5719 | | dla102 | 128 | 1.0 | 0.9958 | 0.8306 | 1.3151 | 1.5816 | 1.5483 | | nfnet_l0 | 128 | 0.999 | 0.8101 | 0.7108 | 0.8479 | 1.558 | 1.4681 | | volo_d1_224 | 64 | 0.9999 | 0.9938 | 0.839 | 0.0 | 1.5526 | 1.5209 | | resnest101e | 64 | 1.0036 | 0.991 | 0.8138 | 0.0 | 1.5479 | 1.5026 | | gmlp_s16_224 | 128 | 0.9999 | 0.9956 | 0.7866 | 1.0145 | 1.5229 | 1.5014 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 0.8543 | 1.1415 | 1.5057 | 1.4717 | | adv_inception_v3 | 128 | 0.9999 | 0.9964 | 0.8533 | 1.1424 | 1.5034 | 1.464 | | inception_v3 | 128 | 0.9998 | 0.9965 | 0.8532 | 1.1417 | 1.5005 | 1.4662 | | dm_nfnet_f0 | 128 | 0.9984 | 0.9993 | 0.8805 | 0.9227 | 1.5002 | 1.4296 | | gmixer_24_224 | 128 | 0.9999 | 0.8807 | 0.7214 | 0.9232 | 1.4936 | 1.4814 | | res2net50_14w_8s | 128 | 1.0001 | 0.9927 | 0.8097 | 0.9912 | 1.4852 | 1.4124 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9588 | 0.0 | 0.0 | 1.4813 | 1.4135 | | mobilenetv3_large_100 | 128 | 0.9531 | 0.9449 | 0.7832 | 0.9312 | 1.4485 | 1.4297 | | selecsls42b | 128 | 0.9999 | 0.9956 | 0.8424 | 1.2844 | 1.443 | 1.4108 | | res2next50 | 128 | 0.9994 | 0.9953 | 0.8336 | 1.1382 | 1.4175 | 1.3462 | | mnasnet_100 | 128 | 0.9535 | 0.9431 | 0.7895 | 1.1803 | 1.416 | 1.4608 | | cait_m36_384 | 4 | 1.0005 | 1.0096 | 0.0 | 0.0 | 1.4152 | 1.3657 | | fbnetv3_b | 128 | 0.9526 | 0.9397 | 0.7747 | 0.0 | 1.4041 | 1.3937 | | mobilenetv2_100 | 128 | 0.951 | 0.9421 | 0.7223 | 1.1218 | 1.4007 | 1.4335 | | crossvit_9_240 | 128 | 1.0001 | 0.9942 | 0.8382 | 0.9173 | 1.3954 | 1.3682 | | convit_base | 64 | 1.0 | 0.9968 | 0.8322 | 1.2379 | 1.3906 | 1.3175 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9642 | 0.7679 | 1.1266 | 1.3718 | 1.3793 | | mobilevit_s | 64 | 0.9732 | 0.8144 | 0.6562 | 0.0 | 1.3608 | 1.3593 | | jx_nest_base | 32 | 1.0 | 0.9925 | 0.7963 | 0.0 | 1.3602 | 1.3268 | | fbnetc_100 | 128 | 0.9523 | 0.9398 | 0.7932 | 1.1204 | 1.3521 | 1.3732 | | spnasnet_100 | 128 | 0.9461 | 0.936 | 0.778 | 1.0918 | 1.3507 | 1.3272 | | resmlp_12_224 | 128 | 1.0 | 0.9986 | 0.7831 | 1.4885 | 1.3303 | 1.2978 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.8072 | 0.0 | 1.326 | 1.2952 | | tf_efficientnet_b0 | 128 | 0.9652 | 0.8074 | 0.6667 | 0.9502 | 1.3246 | 1.3554 | | botnet26t_256 | 128 | 0.9783 | 0.9733 | 0.8124 | 1.2779 | 1.3236 | 1.3302 | | pit_b_224 | 64 | 0.9998 | 0.9953 | 0.8207 | 0.9715 | 1.3156 | 1.3091 | | pnasnet5large | 16 | 1.0051 | 1.0406 | 0.8454 | 0.0 | 1.3115 | 1.2719 | | cspdarknet53 | 64 | 0.9431 | 0.9343 | 0.7569 | 1.0914 | 1.3027 | 1.3242 | | rexnet_100 | 128 | 0.9656 | 0.8497 | 0.6913 | 0.0 | 1.2723 | 1.2774 | | tinynet_a | 128 | 0.9723 | 0.8029 | 0.6588 | 0.7806 | 1.2714 | 1.3288 | | eca_botnext26ts_256 | 128 | 0.9801 | 0.8115 | 0.6714 | 1.072 | 1.2712 | 1.2678 | | mixer_b16_224 | 128 | 0.9999 | 0.9976 | 0.8028 | 0.9024 | 1.2593 | 1.2499 | | beit_base_patch16_224 | 64 | 1.0 | 0.9785 | 0.0 | 0.0 | 1.2465 | 1.2307 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.9913 | 0.7969 | 0.9754 | 1.2391 | 1.222 | | visformer_small | 128 | 0.9996 | 0.999 | 0.8425 | 0.0 | 1.231 | 1.1753 | | dpn107 | 32 | 0.9569 | 0.9281 | 0.7566 | 0.0 | 1.2072 | 1.183 | | sebotnet33ts_256 | 64 | 0.9657 | 0.8369 | 0.6797 | 0.9712 | 1.2037 | 1.1982 | | tf_mixnet_l | 128 | 0.9785 | 0.9092 | 0.7936 | 0.0 | 1.1794 | 1.1732 | | mixnet_l | 128 | 0.9797 | 0.9055 | 0.7949 | 0.0 | 1.1618 | 1.1555 | | gluon_xception65 | 32 | 0.9996 | 0.99 | 0.7474 | 0.0 | 1.159 | 1.1246 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 0.8311 | 0.9109 | 1.1576 | 1.1465 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.9815 | 0.8092 | 0.0 | 1.1355 | 1.0556 | | repvgg_a2 | 128 | 0.9426 | 0.9346 | 0.7987 | 1.0684 | 1.1034 | 1.1196 | | gernet_l | 128 | 0.947 | 0.9378 | 0.7679 | 1.063 | 1.0641 | 1.0776 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9233 | 0.0 | 1.056 | 1.0506 | | convnext_base | 64 | 0.9995 | 0.9953 | 0.8004 | 0.0 | 0.6631 | 0.6452 | | eca_halonext26ts | 128 | 0.9813 | 0.8163 | 0.679 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.931 | 30.4857 | 57.3067 | nan | 150.2292 | 136.4794 | | twins_pcpvt_base | 64 | 2.9951 | 15.3979 | 26.8133 | nan | 130.834 | 129.5663 | | pnasnet5large | 16 | 5.6391 | 23.8783 | 41.1797 | nan | 92.764 | 87.4451 | | xcit_large_24_p8_224 | 5 | 3.5596 | nan | nan | nan | 92.0008 | 88.4883 | | cait_m36_384 | 4 | 3.815 | 19.5988 | nan | nan | 86.6496 | 82.3462 | | swin_base_patch4_window7_224 | 64 | 3.2903 | 13.3388 | nan | nan | 82.786 | 80.3175 | | resnest101e | 64 | 3.7295 | 16.5156 | 27.453 | nan | 79.7768 | 72.6928 | | convnext_base | 64 | 1.5651 | 6.9611 | 11.4667 | nan | 77.0414 | 72.2102 | | mobilevit_s | 64 | 2.0202 | 7.6097 | 15.5645 | nan | 71.0608 | 67.871 | | jx_nest_base | 32 | 2.0348 | 9.2647 | 16.102 | nan | 65.5891 | 63.0269 | | res2net101_26w_4s | 64 | 3.5565 | 16.9651 | 28.2784 | nan | 64.5817 | 60.2342 | | coat_lite_mini | 128 | 1.3165 | 5.4821 | 8.4032 | 113.7861 | 61.5139 | 59.5614 | | res2net50_14w_8s | 128 | 3.1746 | 14.6573 | 24.9225 | 338.0982 | 57.6172 | 53.9023 | | poolformer_m36 | 64 | 1.9082 | 7.4302 | 12.2258 | nan | 56.0511 | 52.1659 | | sebotnet33ts_256 | 64 | 1.9509 | 6.2361 | 13.7414 | 150.4399 | 48.1373 | 46.034 | | gmlp_s16_224 | 128 | 1.4987 | 7.4523 | 12.3432 | 197.9731 | 47.2443 | 44.1018 | | dpn107 | 32 | 4.306 | 13.9007 | 39.7645 | nan | 47.0541 | 43.9241 | | crossvit_9_240 | 128 | 1.872 | 8.655 | 13.5658 | 211.5802 | 45.8403 | 43.3586 | | fbnetv3_b | 128 | 3.531 | 11.7421 | 28.2677 | nan | 45.6063 | 42.7345 | | gluon_xception65 | 32 | 2.3146 | 11.0104 | 18.8315 | nan | 45.2269 | 42.718 | | volo_d1_224 | 64 | 1.4525 | 7.6563 | 12.9226 | nan | 45.068 | 42.3737 | | tnt_s_patch16_224 | 128 | 2.0252 | 11.3096 | nan | nan | 43.6365 | 40.114 | | gluon_inception_v3 | 128 | 1.8479 | 8.4126 | 13.8175 | 190.1402 | 39.904 | 36.691 | | eca_botnext26ts_256 | 128 | 1.5477 | 5.0427 | 10.481 | 124.6177 | 39.792 | 39.2614 | | inception_v3 | 128 | 1.8263 | 8.4564 | 13.5193 | 192.7768 | 39.4032 | 36.5655 | | dla102 | 128 | 2.1101 | 9.6008 | 15.9518 | 256.3065 | 39.3235 | 36.4951 | | ghostnet_100 | 128 | 3.3877 | 9.9212 | 14.8002 | 199.2491 | 39.1227 | 36.6041 | | adv_inception_v3 | 128 | 1.8288 | 8.4327 | 13.5214 | 189.3149 | 38.6505 | 37.1302 | | gmixer_24_224 | 128 | 1.6172 | 8.3054 | 13.7966 | 188.9793 | 38.152 | 35.4031 | | tf_mixnet_l | 128 | 6.2003 | 12.9642 | 27.2885 | nan | 37.9262 | 36.0038 | | swsl_resnext101_32x16d | 32 | 2.196 | 9.2607 | 14.7766 | nan | 37.2668 | 34.7827 | | mixnet_l | 128 | 5.7372 | 12.878 | 26.4945 | nan | 37.1291 | 35.036 | | botnet26t_256 | 128 | 1.5761 | 4.4983 | 9.279 | 94.729 | 35.0677 | 34.1297 | | dm_nfnet_f0 | 128 | 2.3046 | 7.4564 | 11.0241 | 165.0416 | 33.9742 | 32.2667 | | res2next50 | 128 | 1.7858 | 8.2631 | 13.0833 | 205.2612 | 32.7063 | 30.2447 | | convit_base | 64 | 1.3665 | 6.2292 | 9.8919 | 148.4027 | 31.9169 | 30.7947 | | tinynet_a | 128 | 2.3455 | 8.1442 | 19.9872 | 202.1464 | 31.7491 | 30.1005 | | rexnet_100 | 128 | 2.1214 | 7.4928 | 17.1434 | nan | 31.5534 | 29.7933 | | tf_efficientnet_b0 | 128 | 2.0551 | 7.0695 | 16.2992 | 184.1684 | 27.8237 | 25.4269 | | cspdarknet53 | 64 | 2.6122 | 7.5394 | 18.6644 | 152.8653 | 27.1407 | 25.0663 | | spnasnet_100 | 128 | 2.3143 | 6.7729 | 17.1801 | 136.5093 | 26.5412 | 24.7965 | | mixer_b16_224 | 128 | 0.8987 | 3.7968 | 5.9729 | 87.4798 | 26.3356 | 25.4143 | | fbnetc_100 | 128 | 2.3512 | 7.072 | 17.4926 | 139.6215 | 25.8561 | 24.325 | | convmixer_768_32 | 32 | 1.3946 | 6.5936 | 9.9769 | nan | 25.7018 | 24.5606 | | pit_b_224 | 64 | 1.248 | 5.4003 | 8.8056 | 109.6848 | 25.1364 | 23.8347 | | deit_base_distilled_patch16_224 | 64 | 1.0536 | 5.3764 | 7.4986 | 88.2815 | 25.1177 | 25.2995 | | visformer_small | 128 | 1.0378 | 4.2325 | 6.5274 | nan | 25.1067 | 23.9944 | | vit_base_patch16_224 | 64 | 1.1538 | 4.7124 | 8.0766 | 90.9786 | 24.8839 | 23.7608 | | nfnet_l0 | 128 | 2.0544 | 7.5252 | 10.9787 | 150.1953 | 24.7692 | 22.9685 | | resmlp_12_224 | 128 | 0.7995 | 3.1912 | 4.872 | 50.0284 | 24.6682 | 22.7338 | | mobilenetv3_large_100 | 128 | 1.8934 | 5.835 | 13.4829 | 146.7477 | 23.9546 | 23.1319 | | beit_base_patch16_224 | 64 | 1.4003 | 5.8776 | nan | nan | 23.4187 | 21.9121 | | mobilenetv2_100 | 128 | 1.9196 | 5.658 | 12.9948 | 117.1492 | 22.5906 | 21.5039 | | repvgg_a2 | 128 | 2.1604 | 6.1376 | 15.4625 | 200.9345 | 22.2589 | 21.2079 | | mnasnet_100 | 128 | 1.8818 | 5.5318 | 13.3177 | 109.261 | 21.8075 | 19.8951 | | regnety_002 | 128 | 1.7886 | 5.8636 | 13.114 | 118.7427 | 21.8 | 20.2519 | | gernet_l | 128 | 2.1647 | 6.2133 | 15.5146 | 115.6391 | 20.999 | 19.9016 | | selecsls42b | 128 | 0.9436 | 3.8595 | 5.9259 | 91.239 | 18.5606 | 17.3978 | | lcnet_050 | 128 | 1.1515 | 3.4232 | 7.5048 | 83.467 | 15.2332 | 14.6341 | | ese_vovnet19b_dw | 128 | 1.1361 | 3.1755 | 6.8034 | 68.4644 | 14.4607 | 13.635 | | eca_halonext26ts | 128 | 1.6025 | 5.1343 | 11.0416 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9937 | 0.9715 | 0.3561 | 1.3557 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2666 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.1605 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 1.0332 | 1.1822 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3529 | nan | 0.9069 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2717 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.806 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.745 | 0.8294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.486 | 296.8557 | 321.1375 | nan | 280.7942 | 282.1234 | | tnt_s_patch16_224 | 128 | 363.6214 | 364.7147 | nan | nan | 189.0574 | 191.9649 | | hrnet_w18 | 128 | 297.7562 | 289.8731 | 344.5862 | nan | 188.7442 | 221.1564 | | convnext_base | 64 | 121.4143 | 121.6429 | 151.4207 | nan | 183.0963 | 187.5732 | | pnasnet5large | 16 | 229.4869 | 221.4024 | 257.83 | nan | 168.9324 | 173.7203 | | tf_mixnet_l | 128 | 195.1447 | 210.0817 | 240.6266 | nan | 162.2243 | 162.9431 | | mixnet_l | 128 | 186.6718 | 201.9991 | 230.0394 | nan | 157.43 | 158.2623 | | convit_base | 64 | 181.2822 | 181.7059 | 217.6003 | 146.3074 | 130.2669 | 137.5216 | | pit_b_224 | 64 | 154.8196 | 155.4562 | 188.2924 | 159.2155 | 117.5385 | 118.1064 | | cait_m36_384 | 4 | 165.9859 | 164.6621 | nan | nan | 117.3606 | 121.8195 | | dla102 | 128 | 178.2148 | 179.1855 | 214.8749 | 135.5781 | 112.827 | 115.1884 | | poolformer_m36 | 64 | 148.8974 | 149.0187 | 183.9342 | nan | 112.0757 | 114.8132 | | beit_base_patch16_224 | 64 | 134.9152 | 137.7806 | nan | nan | 108.2304 | 109.6662 | | resnest101e | 64 | 167.9436 | 165.3773 | 199.6766 | nan | 108.0546 | 113.5857 | | adv_inception_v3 | 128 | 160.9935 | 161.5713 | 188.5554 | 140.888 | 107.1038 | 109.8873 | | inception_v3 | 128 | 160.6292 | 161.0654 | 188.1764 | 140.9007 | 107.0841 | 109.5081 | | gluon_inception_v3 | 128 | 160.974 | 161.5206 | 188.646 | 141.1272 | 107.0629 | 109.3358 | | vit_base_patch16_224 | 64 | 120.4637 | 121.1913 | 144.9533 | 132.1297 | 104.0355 | 104.985 | | swsl_resnext101_32x16d | 32 | 117.7744 | 120.0279 | 145.97 | nan | 103.9766 | 111.3971 | | res2net50_14w_8s | 128 | 145.4328 | 146.8044 | 179.7227 | 147.1104 | 99.6889 | 104.0853 | | swin_base_patch4_window7_224 | 64 | 147.0818 | 153.2668 | nan | nan | 99.4041 | 104.0568 | | res2next50 | 128 | 138.6325 | 138.6725 | 166.1529 | 121.6646 | 97.7916 | 102.4728 | | mixer_b16_224 | 128 | 118.3458 | 118.6056 | 147.521 | 131.0367 | 94.006 | 94.6202 | | dpn107 | 32 | 114.1541 | 115.883 | 142.7404 | nan | 93.7816 | 91.7772 | | gmlp_s16_224 | 128 | 136.292 | 136.5303 | 173.1511 | 134.0163 | 89.4771 | 90.6161 | | jx_nest_base | 32 | 118.8976 | 119.7334 | 149.3025 | nan | 87.3509 | 89.5918 | | dm_nfnet_f0 | 128 | 131.6929 | 131.5566 | 148.9204 | 142.1526 | 87.1907 | 91.5997 | | volo_d1_224 | 64 | 134.5478 | 134.9864 | 160.1128 | nan | 86.5848 | 88.2192 | | eca_botnext26ts_256 | 128 | 112.1036 | 135.4767 | 163.5368 | 102.4249 | 86.3948 | 86.6135 | | gluon_xception65 | 32 | 97.8576 | 98.5746 | 130.6599 | nan | 84.3137 | 86.6696 | | fbnetv3_b | 128 | 120.8277 | 122.5649 | 148.5787 | nan | 83.0267 | 84.5648 | | gmixer_24_224 | 128 | 119.7908 | 136.0844 | 166.2186 | 129.8953 | 80.2727 | 80.8411 | | visformer_small | 128 | 98.1431 | 97.9784 | 116.7458 | nan | 79.8902 | 83.4777 | | botnet26t_256 | 128 | 106.0373 | 106.5229 | 127.725 | 81.1549 | 78.4519 | 77.8833 | | crossvit_9_240 | 128 | 109.2776 | 109.8997 | 130.2487 | 119.1501 | 78.2862 | 79.7036 | | res2net101_26w_4s | 64 | 121.7017 | 129.0133 | 126.692 | nan | 77.7325 | 95.1401 | | twins_pcpvt_base | 64 | 125.2206 | 143.6159 | 138.8939 | nan | 76.4556 | 81.9545 | | deit_base_distilled_patch16_224 | 64 | 94.1628 | 94.926 | 117.9446 | 96.3779 | 75.9289 | 76.9224 | | coat_lite_mini | 128 | 115.747 | 117.2487 | 137.6673 | 100.5884 | 72.9963 | 73.6657 | | gernet_l | 128 | 79.6333 | 80.5474 | 98.5914 | 71.0486 | 70.9857 | 70.0816 | | cspdarknet53 | 64 | 95.9161 | 96.6293 | 119.4499 | 82.8709 | 69.3552 | 68.1578 | | rexnet_100 | 128 | 90.8942 | 103.0498 | 127.0586 | nan | 68.8717 | 68.7081 | | repvgg_a2 | 128 | 79.6315 | 80.322 | 94.1804 | 70.3765 | 68.2501 | 67.1488 | | nfnet_l0 | 128 | 106.2833 | 131.0196 | 148.4146 | 124.5304 | 68.098 | 72.2292 | | sebotnet33ts_256 | 64 | 83.2625 | 96.032 | 118.2151 | 82.8439 | 66.8136 | 66.9768 | | tf_efficientnet_b0 | 128 | 90.5795 | 108.2595 | 131.1144 | 92.0013 | 65.9276 | 64.3744 | | mobilevit_s | 64 | 89.9782 | 107.4419 | 133.4305 | nan | 64.2514 | 64.3669 | | xcit_large_24_p8_224 | 5 | 128.6823 | nan | nan | nan | 62.0273 | 73.0838 | | fbnetc_100 | 128 | 87.9137 | 88.9833 | 105.5735 | 74.7723 | 61.9827 | 60.9264 | | tinynet_a | 128 | 75.7975 | 90.8109 | 110.6837 | 99.1569 | 58.0368 | 60.6362 | | spnasnet_100 | 128 | 76.555 | 77.3926 | 93.1575 | 66.2718 | 53.6136 | 54.5766 | | resmlp_12_224 | 128 | 68.1068 | 68.3201 | 87.2123 | 45.765 | 51.2553 | 52.5835 | | ese_vovnet19b_dw | 128 | 67.7937 | 68.2858 | 85.9112 | 58.4775 | 47.9989 | 47.7794 | | mnasnet_100 | 128 | 69.989 | 70.7781 | 84.6785 | 56.6127 | 47.2192 | 45.7042 | | ghostnet_100 | 128 | 95.9114 | 97.6094 | 107.551 | 101.2856 | 46.0617 | 54.3073 | | mobilenetv2_100 | 128 | 67.4917 | 68.2319 | 88.9909 | 57.3132 | 45.9347 | 44.8565 | | selecsls42b | 128 | 62.7781 | 62.9914 | 74.5785 | 48.8287 | 43.545 | 44.503 | | mobilenetv3_large_100 | 128 | 66.0257 | 66.6214 | 80.2942 | 68.1801 | 43.415 | 44.0017 | | regnety_002 | 128 | 57.9244 | 60.1452 | 46.9374 | 65.124 | 25.5235 | 37.5032 | | lcnet_050 | 128 | 34.0898 | 34.6161 | 38.6199 | 33.3388 | 16.3715 | 20.7263 | | eca_halonext26ts | 128 | 115.8373 | 139.2614 | 167.7038 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

Build Summary

### Run name ### day_325_21_11_22_performance_amp_342 ### Commit hashes ### pytorch commit: 80352a8c91bbb4f9b94fadab982608d6a2050db1 functorch Absent torchbench commit: 63d4037c8738908f3edfb3f7af69888378f57929 ### TorchDynamo config flags ### torch._dynamo.config.HAS_REFS_PRIMS = True torch._dynamo.config.capture_scalar_outputs = False torch._dynamo.config.dead_code_elimination = True torch._dynamo.config.dynamic_propagation = True torch._dynamo.config.dynamic_shapes = False torch._dynamo.config.enforce_cond_guards_match = True torch._dynamo.config.error_on_nested_fx_trace = True torch._dynamo.config.fake_tensor_propagation = True torch._dynamo.config.guard_nn_modules = False torch._dynamo.config.normalize_ir = False torch._dynamo.config.optimize_ddp = False torch._dynamo.config.print_graph_breaks = False torch._dynamo.config.raise_on_ctx_manager_usage = True torch._dynamo.config.raise_on_unsafe_aot_autograd = False torch._dynamo.config.replay_record_enabled = False torch._dynamo.config.specialize_int_float = True torch._dynamo.config.suppress_errors = False torch._dynamo.config.verbose = False torch._dynamo.config.verify_correctness = False ### Torch version ### torch: 1.14.0.dev20221114+cu116 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8302 Number CUDA Devices: 1 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.314694656

williamwen42 commented 2 years ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 93%, 57/61  |
|     aot_cudagraphs     | 85%, 46/54 | 81%, 34/42  | 89%, 54/61  |
|    nvprims_nvfuser     | 59%, 32/54 |  10%, 4/42  | 52%, 32/61  |
|        inductor        | 81%, 44/54 | 90%, 38/42  | 90%, 55/61  |
| inductor_no_cudagraphs | 85%, 46/54 | 90%, 38/42  | 90%, 55/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|     aot_cudagraphs     |   1.22x    |    1.12x    |    1.00x    |
|    nvprims_nvfuser     |   1.02x    |    1.04x    |    1.08x    |
|        inductor        |   1.84x    |    1.74x    |    1.41x    |
| inductor_no_cudagraphs |   1.38x    |    1.53x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.06    |    2.84     |    2.33     |
|       aot_eager        |    6.61    |    10.24    |    8.69     |
|     aot_cudagraphs     |    9.51    |    16.50    |    16.36    |
|    nvprims_nvfuser     |   66.11    |   133.86    |   151.35    |
|        inductor        |   33.97    |    38.49    |    44.16    |
| inductor_no_cudagraphs |   34.21    |    33.58    |    41.73    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.89x    |    0.87x    |
|     aot_cudagraphs     |   0.41x    |    0.38x    |    0.33x    |
|    nvprims_nvfuser     |   0.83x    |    1.01x    |    0.86x    |
|        inductor        |   0.83x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.96x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name: /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 81%, 44/54 | 81%, 44/54 | | inductor | huggingface | 87%, 39/45 | 87%, 39/45 | | inductor | timm_models | 89%, 54/61 | 90%, 55/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 87%, 47/54 | | inductor_no_cudagraphs | huggingface | 91%, 41/45 | 91%, 41/45 | | inductor_no_cudagraphs | timm_models | 89%, 54/61 | 89%, 54/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.67x | 1.66x | | inductor | huggingface | 1.62x | 1.63x | | inductor | timm_models | 1.19x | 1.17x | | inductor_no_cudagraphs | torchbench | 1.28x | 1.28x | | inductor_no_cudagraphs | huggingface | 1.53x | 1.54x | | inductor_no_cudagraphs | timm_models | 1.16x | 1.16x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | vision_maskrcnn | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | BigBird | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | poolformer_m36 | fail_accuracy | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8633 | | torchbench | tacotron2 | 0.0 | 0.8824 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6631 | 0.6452 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 404.1995 | 416.489 | | torchbench | timm_efficientdet | 146.2678 | 144.8974 | | torchbench | hf_T5_large | 145.3088 | 139.5987 | | timm_models | hrnet_w18 | 150.2292 | 136.4794 | | timm_models | twins_pcpvt_base | 130.834 | 129.5663 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | speech_transformer | 0.8824 | 0.8866 | | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8778 | 1.0948 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0014 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4481 | 0.4691 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8828 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9876 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9516 | | huggingface | DistillGPT2 | 0.8048 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.752 | 0.9892 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.026 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | convnext_base | 0.806 | 0.9865 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8294 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Accuracy regressions ~~~ +----------+------+-------------+-------------+ | compiler | name | prev_status | cur_status | +----------+------+-------------+-------------+ | inductor | dlrm | pass | fail_to_run | +----------+------+-------------+-------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_142 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_325_21_11_22_performance_amp_324 Compilation latency (sec) regressions ~~~ +----------+-------------+-------------+------------+ | compiler | name | prev_status | cur_status | +----------+-------------+-------------+------------+ | inductor | mobilevit_s | 118.9846 | 121.0034 | +----------+-------------+-------------+------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 0.9269 | 2.4759 | 0.7336 | 6.1007 | 1.3179 | | functorch_dp_cifar10 | 64 | 1.0025 | 0.959 | 2.3644 | 0.0 | 5.0593 | 0.9792 | | timm_efficientdet | 1 | 0.9846 | 0.8224 | 2.1111 | 0.0 | 4.754 | 1.5319 | | resnext50_32x4d | 8 | 1.0029 | 0.9629 | 1.9044 | 0.7558 | 3.5498 | 1.2678 | | timm_vision_transformer | 8 | 1.0015 | 0.8456 | 1.8027 | 0.59 | 3.4415 | 1.532 | | BERT_pytorch | 16 | 1.0065 | 0.8313 | 1.5678 | 0.8309 | 3.366 | 2.332 | | mobilenet_v3_large | 32 | 1.0033 | 1.0061 | 1.6121 | 0.7691 | 3.0827 | 1.3913 | | drq | 1 | 1.0088 | 0.8228 | 1.9929 | 0.608 | 3.0015 | 1.1596 | | dcgan | 32 | 0.9819 | 0.9163 | 1.6644 | 0.7106 | 2.8668 | 1.0467 | | resnet18 | 16 | 1.0017 | 0.997 | 1.584 | 0.7957 | 2.8116 | 1.2074 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9956 | 0.976 | 1.7732 | 0.0 | 2.7857 | 1.5668 | | hf_T5_large | 2 | 1.0196 | 0.8562 | 0.0 | 0.0 | 2.6305 | 2.1346 | | mnasnet1_0 | 32 | 1.0 | 1.021 | 1.2678 | 0.7709 | 2.6232 | 1.3497 | | squeezenet1_1 | 32 | 0.9942 | 0.9626 | 1.4509 | 0.7253 | 2.4487 | 1.3039 | | hf_Albert | 8 | 1.0025 | 0.9621 | 0.7743 | 0.0 | 2.3629 | 2.2746 | | hf_GPT2 | 4 | 1.0238 | 0.9834 | 0.8156 | 0.2905 | 2.128 | 1.9203 | | pytorch_struct | 200 | 0.9858 | 0.7499 | 1.0158 | 0.5997 | 2.1278 | 1.28 | | timm_efficientnet | 32 | 0.9617 | 0.819 | 1.0779 | 0.6806 | 2.1064 | 1.2819 | | hf_Bert | 4 | 1.0358 | 0.8393 | 0.9547 | 0.0 | 2.0757 | 1.8356 | | lennard_jones | 1000 | 0.9695 | 0.7698 | 1.3011 | 0.4693 | 2.0722 | 1.0623 | | resnet152 | 32 | 1.0018 | 1.0101 | 1.2666 | 0.0 | 2.0638 | 1.3011 | | timm_resnest | 32 | 1.0068 | 1.0167 | 0.8369 | 0.9652 | 1.9156 | 1.6651 | | hf_T5 | 8 | 0.9997 | 0.919 | 0.0 | 1.3547 | 1.8668 | 1.8751 | | resnet50 | 32 | 1.0015 | 1.0246 | 1.0439 | 0.811 | 1.8012 | 1.3458 | | LearningToPaint | 96 | 1.003 | 1.0147 | 1.1631 | 0.8377 | 1.7935 | 1.3141 | | hf_Bart | 4 | 1.0128 | 0.8329 | 0.9446 | 0.0 | 1.758 | 1.8321 | | soft_actor_critic | 256 | 1.0176 | 0.7414 | 1.3388 | 0.5477 | 1.746 | 1.0551 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0223 | 0.9819 | 0.8605 | 1.703 | 1.4324 | | mobilenet_v2 | 96 | 1.0001 | 1.0065 | 0.7606 | 1.0345 | 1.5589 | 1.5181 | | speech_transformer | 32 | 0.9559 | 0.8244 | 1.7561 | 0.0 | 1.5304 | 1.5474 | | attention_is_all_you_need_pytorch | 256 | 1.0068 | 0.9027 | 0.8406 | 0.0 | 1.5285 | 1.58 | | timm_nfnet | 128 | 0.9991 | 1.0 | 0.8727 | 0.92 | 1.5078 | 1.4307 | | fastNLP_Bert | 6 | 0.9992 | 0.8893 | 0.7649 | 0.0 | 1.5043 | 1.4513 | | hf_DistilBert | 8 | 1.0017 | 0.9746 | 0.742 | 0.3688 | 1.492 | 1.4593 | | pytorch_stargan | 16 | 0.9951 | 1.0961 | 1.0396 | 0.0 | 1.4619 | 1.5082 | | pytorch_unet | 1 | 0.9996 | 0.9921 | 0.8639 | 1.0838 | 1.3621 | 1.331 | | timm_regnet | 32 | 0.9786 | 0.9422 | 0.9011 | 0.7826 | 1.3385 | 1.2223 | | timm_vovnet | 32 | 0.9205 | 0.8797 | 0.8693 | 0.7984 | 1.2996 | 1.1491 | | vgg16 | 64 | 0.9996 | 0.9972 | 0.8566 | 0.9734 | 1.2708 | 1.2639 | | Background_Matting | 4 | 0.9999 | 1.0155 | 0.8959 | 1.0571 | 1.2373 | 1.2197 | | Super_SloMo | 6 | 0.9993 | 0.995 | 0.8851 | 0.0 | 1.2277 | 1.1941 | | alexnet | 128 | 0.999 | 0.9977 | 0.815 | 0.928 | 1.2089 | 1.2102 | | hf_Reformer | 4 | 0.9987 | 1.0002 | 0.9928 | 0.6513 | 1.1761 | 1.1801 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9903 | 0.0 | 0.0 | 1.0903 | 1.0719 | | yolov3 | 16 | 0.9997 | 0.9906 | 0.8035 | 0.0 | 1.0881 | 1.0689 | | tts_angular | 64 | 0.975 | 0.9437 | 0.9749 | 0.9511 | 1.0167 | 1.0065 | | demucs | 4 | 1.0014 | 1.0 | 1.0002 | 0.998 | 1.0017 | 1.0006 | | nvidia_deeprecommender | 256 | 0.9989 | 0.996 | 0.697 | 1.0074 | 0.9892 | 1.0305 | | hf_GPT2_large | 4 | 1.0002 | 0.9907 | 0.0 | 0.0 | 0.0 | 1.8633 | | tacotron2 | 64 | 0.988 | 0.7645 | 0.9786 | 0.5994 | 0.0 | 0.8824 | | dlrm | 2048 | 1.01 | 1.1541 | 0.0 | 1.1273 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9843 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_base | 2 | pass | pass | fail_to_run | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | fail_to_run | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | fail_to_run | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Albert | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bart | 2 | pass | pass | pass | fail_to_run | pass | pass | | hf_Bert | 2 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | fail_to_run | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | fail_to_run | pass | pass | | resnet152 | 2 | pass | pass | pass | fail_to_run | pass | pass | | speech_transformer | 2 | pass | pass | pass | fail_accuracy | pass | pass | | soft_actor_critic | 256 | pass | pass | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | pass | pass | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | vision_maskrcnn | 2 | pass | pass | fail_to_run | 0.0000 | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | pass | fail_accuracy | fail_to_run | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | pass | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | pass | 0.0000 | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ | yolov3 | 16 | 3.1344 | 8.4555 | 11.8143 | nan | 404.1995 | 416.489 | | timm_efficientdet | 1 | 20.2634 | 39.3318 | 77.1986 | nan | 146.2678 | 144.8974 | | hf_T5_large | 2 | 14.8888 | 39.8919 | nan | nan | 145.3088 | 139.5987 | | timm_vision_transformer_large | 8 | 3.0569 | 15.4787 | nan | nan | 72.4614 | 69.0113 | | resnet152 | 32 | 2.7633 | 14.3372 | 22.2274 | nan | 53.5223 | 52.8041 | | densenet121 | 4 | 2.4205 | 12.1053 | 19.2315 | 234.7833 | 52.0325 | 51.0502 | | attention_is_all_you_need_pytorch | 256 | 1.4406 | 7.2285 | 11.5793 | nan | 40.3385 | 39.5388 | | timm_resnest | 32 | 0.6749 | 2.5525 | 3.8456 | 66.1754 | 39.2066 | 38.0779 | | speech_transformer | 32 | 2.0109 | 8.9765 | 34.3297 | nan | 36.4502 | 34.9328 | | hf_Bart | 4 | 2.0912 | 9.0258 | 14.3815 | nan | 36.1823 | 35.5936 | | timm_vision_transformer | 8 | 1.031 | 4.6098 | 6.8122 | 84.3136 | 35.9455 | 35.3652 | | BERT_pytorch | 16 | 1.8364 | 7.6958 | 11.5181 | 134.15 | 35.846 | 35.7492 | | fastNLP_Bert | 6 | 1.9116 | 7.2674 | 11.4915 | nan | 33.1912 | 30.7348 | | timm_nfnet | 128 | 2.2018 | 7.4185 | 11.3308 | 159.2067 | 32.7195 | 32.5884 | | hf_T5 | 8 | 2.7481 | 9.1221 | nan | 107.4401 | 32.4549 | 31.0626 | | timm_regnet | 32 | 2.4918 | 8.6327 | 19.8749 | 145.7126 | 28.8942 | 28.5276 | | pytorch_stargan | 16 | 0.4649 | 2.1492 | 2.9664 | nan | 28.1889 | 26.0531 | | timm_efficientnet | 32 | 1.9295 | 7.3003 | 15.603 | 151.8019 | 27.4539 | 27.061 | | mobilenet_v3_large | 32 | 1.0471 | 4.797 | 7.372 | 119.58 | 26.0092 | 25.9043 | | hf_Bert | 4 | 1.8867 | 7.2902 | 10.3758 | nan | 24.4481 | 23.5085 | | hf_Albert | 8 | 1.6492 | 6.7058 | 10.2512 | nan | 23.2417 | 22.1951 | | functorch_dp_cifar10 | 64 | 0.3445 | 1.4309 | 2.1635 | nan | 22.6126 | 22.7953 | | pytorch_struct | 200 | 0.2883 | 0.8641 | 1.6177 | 7.6025 | 22.4876 | 22.2684 | | mnasnet1_0 | 32 | 0.9474 | 4.3783 | 6.6271 | 88.0587 | 21.6773 | 21.1308 | | hf_GPT2 | 4 | 1.8656 | 6.5011 | 9.3689 | 114.2712 | 21.0139 | 20.0732 | | resnet50 | 32 | 1.0144 | 4.9032 | 6.8316 | 99.4446 | 20.7345 | 20.4962 | | shufflenet_v2_x1_0 | 128 | 1.1795 | 5.4288 | 7.7034 | 101.6202 | 20.552 | 20.3635 | | resnext50_32x4d | 8 | 1.0804 | 4.6139 | 6.8963 | 84.0624 | 20.3942 | 19.7537 | | timm_vovnet | 32 | 1.6063 | 4.5008 | 10.0066 | 72.0347 | 20.2442 | 19.9906 | | mobilenet_v2 | 96 | 0.9566 | 4.9474 | 7.0675 | 116.743 | 19.8935 | 19.3603 | | Background_Matting | 4 | 0.9599 | 4.4259 | 6.5969 | 96.3123 | 19.0115 | 17.7799 | | hf_Reformer | 4 | 1.6744 | 3.0553 | 5.483 | 17.8538 | 18.9639 | 16.267 | | Super_SloMo | 6 | 0.9908 | 4.0544 | 5.7403 | nan | 17.5075 | 16.591 | | hf_DistilBert | 8 | 0.8338 | 3.5538 | 5.7907 | 64.2335 | 15.5308 | 14.8387 | | resnet18 | 16 | 0.4733 | 1.8125 | 2.6291 | 38.0577 | 11.5643 | 11.5422 | | dcgan | 32 | 0.1827 | 0.4312 | 0.679 | 5.0555 | 10.388 | 9.9073 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4759 | 2.0248 | 2.852 | nan | 9.1949 | 9.0932 | | pytorch_unet | 1 | 0.4486 | 1.9193 | 2.7622 | 38.7551 | 8.4807 | 8.2167 | | LearningToPaint | 96 | 0.4988 | 1.9185 | 2.8943 | 47.227 | 8.2431 | 7.8678 | | squeezenet1_1 | 32 | 0.2749 | 0.9414 | 1.4095 | 6.9055 | 4.7654 | 4.5126 | | vgg16 | 64 | 0.209 | 0.6473 | 1.102 | 5.6309 | 4.2742 | 3.9492 | | drq | 1 | 0.3217 | 0.6423 | 1.0229 | 6.1416 | 4.2633 | 3.6368 | | nvidia_deeprecommender | 256 | 0.2211 | 0.5266 | 0.8912 | 5.6896 | 3.745 | 3.4994 | | soft_actor_critic | 256 | 0.2103 | 0.3601 | 0.5803 | 3.2728 | 3.5436 | 3.0174 | | alexnet | 128 | 0.1783 | 0.4468 | 0.7337 | 5.1929 | 3.325 | 3.3008 | | lennard_jones | 1000 | 0.1589 | 0.367 | 0.5531 | 2.942 | 2.3328 | 1.9799 | | tts_angular | 64 | 0.1937 | 0.2399 | 0.3659 | 1.5238 | 1.9197 | 1.7273 | | demucs | 4 | 0.3371 | 0.3585 | 0.3553 | 0.3639 | 0.2731 | 0.2673 | | hf_GPT2_large | 4 | 5.7771 | 20.2502 | nan | nan | nan | 58.0332 | | tacotron2 | 64 | 6.9867 | 20.1316 | 34.6561 | 91.2874 | nan | 45.901 | | dlrm | 2048 | 0.4851 | 0.8588 | nan | 4.3981 | nan | nan | | hf_BigBird | 2 | 4.0095 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 0.2718 | 0.4638 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 0.3119 | 0.9124 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 0.3843 | nan | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 0.3556 | 0.4815 | 1.0334 | 1.1302 | | hf_Albert | 8 | 1.0001 | 0.936 | 0.3267 | nan | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 0.3514 | nan | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.3079 | nan | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0142 | 0.9624 | 0.3723 | 0.9771 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9853 | 1.0003 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.38 | 1.118 | 0.9649 | 1.1241 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.4232 | nan | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.3492 | 0.8027 | 0.9345 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | nan | 1.014 | 0.9304 | 1.2458 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.3631 | nan | 0.9125 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.3572 | 0.8496 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.3537 | nan | 0.9063 | 1.0466 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.3341 | nan | 0.8824 | 0.8866 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | nan | nan | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.3998 | 1.1039 | 0.8778 | 1.0948 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.3482 | 0.8451 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.3673 | 0.8452 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.3463 | 0.8714 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.3903 | nan | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.3575 | 0.8489 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.3561 | 0.7806 | 0.8659 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | nan | nan | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.3414 | 1.0617 | 0.8348 | 0.9049 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.3702 | nan | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.4792 | 0.775 | 0.7973 | 1.0079 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.3635 | nan | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.3448 | 0.7921 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.3407 | 0.7755 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.4252 | nan | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.3888 | 0.81 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.3775 | 0.7341 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.3408 | 0.8226 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.4769 | 0.8309 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.4737 | 0.9303 | 0.7295 | 1.0368 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.383 | 0.6701 | 0.7295 | 0.925 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.3918 | 1.0881 | 0.7133 | 0.7227 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.3943 | 0.7314 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.6037 | 0.9999 | 0.5851 | 1.0014 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.3734 | 0.9996 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5124 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4465 | nan | 0.4481 | 0.4691 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4858 | 0.5099 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.5014 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | nan | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | 0.3142 | 0.3906 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ | timm_vision_transformer_large | 8 | 183.9264 | 185.8603 | nan | nan | 168.7355 | 171.7368 | | Background_Matting | 4 | 141.7648 | 131.3625 | 148.8745 | 126.0196 | 107.7583 | 109.2646 | | hf_T5 | 8 | 174.4926 | 189.4402 | nan | 128.4126 | 93.3232 | 92.963 | | hf_T5_large | 2 | 218.1989 | 260.3696 | nan | nan | 89.1603 | 110.8583 | | timm_nfnet | 128 | 131.8874 | 131.5784 | 149.7607 | 142.2406 | 87.2286 | 91.5348 | | hf_Reformer | 4 | 82.3598 | 82.1781 | 82.8209 | 126.1949 | 69.8371 | 69.6854 | | Super_SloMo | 6 | 79.0805 | 79.3464 | 89.5435 | nan | 64.5105 | 66.2058 | | yolov3 | 16 | 68.667 | 69.0193 | 85.3037 | nan | 62.9919 | 64.2431 | | demucs | 4 | 57.9343 | 57.1161 | 57.2196 | 57.1238 | 57.0935 | 57.2062 | | timm_regnet | 32 | 73.5289 | 81.4089 | 81.1558 | 91.6902 | 55.1698 | 60.0619 | | vgg16 | 64 | 66.2422 | 66.2093 | 77.0694 | 67.8099 | 52.002 | 52.3533 | | resnet152 | 32 | 91.0037 | 97.7275 | 73.4911 | nan | 45.7896 | 73.8826 | | speech_transformer | 32 | 65.2427 | 75.1714 | 34.8839 | nan | 41.5773 | 40.3135 | | fastNLP_Bert | 6 | 55.9758 | 62.4977 | 72.653 | nan | 37.2314 | 38.5491 | | timm_efficientdet | 1 | 163.1827 | 214.6085 | 76.5472 | nan | 36.1618 | 110.5349 | | attention_is_all_you_need_pytorch | 256 | 52.8984 | 59.2412 | 63.2279 | nan | 34.8035 | 37.186 | | hf_Bart | 4 | 55.5883 | 67.7852 | 65.7889 | nan | 33.957 | 36.148 | | mobilenet_v2 | 96 | 48.8565 | 49.4278 | 64.2011 | 47.2261 | 31.3401 | 32.1664 | | hf_Albert | 8 | 68.2827 | 72.0985 | 88.2802 | nan | 29.3207 | 29.982 | | pytorch_unet | 1 | 39.9271 | 40.1581 | 46.2402 | 36.8037 | 29.3201 | 29.9666 | | hf_GPT2 | 4 | 52.4292 | 49.6814 | 60.1295 | 168.5094 | 25.4594 | 25.8753 | | timm_vovnet | 32 | 34.752 | 38.1958 | 37.1268 | 40.731 | 24.8979 | 28.7185 | | shufflenet_v2_x1_0 | 128 | 42.876 | 42.1499 | 41.6597 | 49.9317 | 24.2456 | 29.1135 | | timm_efficientnet | 32 | 48.7395 | 61.4532 | 43.3363 | 69.7235 | 22.4523 | 37.767 | | hf_Bert | 4 | 40.6596 | 58.173 | 44.0914 | nan | 21.2743 | 23.4495 | | hf_DistilBert | 8 | 30.9806 | 31.8895 | 41.8606 | 84.3181 | 20.8157 | 21.2662 | | resnet50 | 32 | 33.7115 | 35.1441 | 32.3154 | 41.5451 | 19.3801 | 27.46 | | BERT_pytorch | 16 | 55.6925 | 66.4554 | 35.0948 | 66.3584 | 16.8192 | 24.9592 | | timm_resnest | 32 | 25.0597 | 24.8603 | 29.4839 | 25.4079 | 12.8525 | 15.7415 | | densenet121 | 4 | 72.9717 | 81.5106 | 29.9783 | 100.9377 | 12.6771 | 59.61 | | mobilenet_v3_large | 32 | 34.9903 | 34.941 | 24.01 | 47.3614 | 11.9799 | 26.5817 | | mnasnet1_0 | 32 | 28.9991 | 28.4173 | 23.1117 | 38.0174 | 11.4931 | 22.367 | | pytorch_stargan | 16 | 16.102 | 15.896 | 15.4703 | nan | 10.9192 | 11.5913 | | nvidia_deeprecommender | 256 | 10.3666 | 10.4037 | 14.8759 | 10.2899 | 10.4632 | 10.05 | | timm_vision_transformer | 8 | 33.921 | 34.6595 | 16.5079 | 50.2954 | 9.9535 | 20.4789 | | resnext50_32x4d | 8 | 33.0804 | 30.4899 | 15.5983 | 43.0998 | 8.4924 | 23.3554 | | LearningToPaint | 96 | 15.4426 | 14.8511 | 12.7876 | 18.0053 | 8.4605 | 11.4183 | | alexnet | 128 | 9.7884 | 9.8124 | 12.0045 | 10.5796 | 8.0901 | 8.1139 | | tts_angular | 64 | 6.9398 | 6.5844 | 6.4183 | 6.8377 | 6.7018 | 7.2245 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.098 | 18.5743 | 10.1651 | nan | 6.6768 | 11.9156 | | squeezenet1_1 | 32 | 15.1004 | 15.4611 | 10.1004 | 20.9457 | 6.215 | 11.7538 | | resnet18 | 16 | 12.9642 | 13.1091 | 8.0182 | 16.4295 | 4.7231 | 11.7877 | | functorch_dp_cifar10 | 64 | 14.21 | 15.0108 | 6.0075 | nan | 2.9591 | 15.0933 | | pytorch_struct | 200 | 4.6757 | 6.1055 | 4.513 | 7.7912 | 2.277 | 3.7498 | | drq | 1 | 3.8879 | 4.7963 | 1.9564 | 6.6998 | 1.3729 | 3.6489 | | dcgan | 32 | 3.1322 | 3.4376 | 1.8964 | 4.4751 | 1.107 | 2.9838 | | soft_actor_critic | 256 | 1.3741 | 1.8739 | 1.0807 | 2.8127 | 0.8557 | 1.4163 | | lennard_jones | 1000 | 1.4503 | 2.1461 | 1.1727 | 3.1974 | 0.749 | 1.4673 | | tacotron2 | 64 | 3526.5577 | 4226.6164 | 3367.1203 | 5061.5669 | nan | 3532.7074 | | hf_GPT2_large | 4 | 209.2206 | 211.7662 | nan | nan | nan | 112.3685 | | dlrm | 2048 | 501.5169 | 490.557 | nan | 499.3797 | nan | nan | | hf_BigBird | 2 | 195.5097 | nan | nan | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | YituTechConvBert | 1 | 1.0223 | 0.8377 | 2.3103 | 0.0 | 4.8405 | 1.6583 | | MobileBertForMaskedLM | 32 | 1.0172 | 0.8422 | 2.0319 | 0.0 | 4.1581 | 1.8028 | | CamemBert | 1 | 1.0447 | 0.8521 | 1.8013 | 0.0 | 3.7763 | 1.7973 | | MobileBertForQuestionAnswering | 64 | 1.0168 | 0.8377 | 1.5134 | 0.0 | 3.6592 | 1.7789 | | MT5ForConditionalGeneration | 8 | 1.0153 | 0.8552 | 1.5607 | 0.8664 | 3.4685 | 2.5255 | | DistillGPT2 | 1 | 1.0365 | 0.8788 | 1.4926 | 0.0 | 2.704 | 2.0011 | | GPT2ForSequenceClassification | 4 | 1.0029 | 0.9693 | 0.0 | 0.5045 | 2.3192 | 2.2924 | | M2M100ForConditionalGeneration | 8 | 1.0065 | 0.9218 | 1.2466 | 0.7002 | 2.2067 | 1.7105 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9797 | 0.7678 | 0.0 | 2.0342 | 1.9779 | | MegatronBertForQuestionAnswering | 16 | 1.0356 | 0.8521 | 1.0639 | 0.0 | 1.95 | 1.8031 | | PLBartForConditionalGeneration | 16 | 1.0125 | 0.8352 | 1.0355 | 0.0 | 1.8827 | 1.6882 | | MegatronBertForCausalLM | 16 | 1.0334 | 0.8527 | 0.9918 | 0.0 | 1.8022 | 1.7497 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9803 | 0.7756 | 0.0 | 1.7954 | 1.7491 | | ElectraForCausalLM | 32 | 0.9998 | 0.9298 | 0.7149 | 0.0 | 1.7505 | 1.7562 | | XGLMForCausalLM | 8 | 1.0122 | 0.8251 | 0.934 | 0.0 | 1.7391 | 1.7801 | | T5Small | 1 | 1.0264 | 0.9043 | 1.1552 | 0.8555 | 1.7388 | 1.5015 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8859 | 0.0 | 0.0 | 1.6477 | 1.6393 | | AlbertForMaskedLM | 4 | 1.0002 | 0.885 | 0.0 | 0.0 | 1.6361 | 1.6283 | | MBartForConditionalGeneration | 16 | 1.0151 | 0.8351 | 0.9222 | 0.0 | 1.6334 | 1.5862 | | PegasusForConditionalGeneration | 16 | 1.0127 | 0.8279 | 0.9093 | 0.6363 | 1.6253 | 1.529 | | LayoutLMForMaskedLM | 16 | 1.0008 | 0.9707 | 0.7557 | 0.0 | 1.606 | 1.5814 | | T5ForConditionalGeneration | 4 | 1.0079 | 0.9015 | 0.758 | 1.1634 | 1.6022 | 1.5676 | | OPTForCausalLM | 32 | 1.0068 | 0.9306 | 0.7722 | 0.3392 | 1.5325 | 1.5097 | | Speech2Text2ForCausalLM | 128 | 1.0069 | 0.9343 | 0.7224 | 0.8106 | 1.4927 | 1.4985 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9849 | 0.7793 | 0.0 | 1.4461 | 1.4066 | | DistilBertForQuestionAnswering | 64 | 1.0007 | 0.9477 | 0.7432 | 0.3628 | 1.442 | 1.3996 | | BertForQuestionAnswering | 128 | 1.0 | 0.9745 | 0.7777 | 0.0 | 1.4387 | 1.4119 | | BartForConditionalGeneration | 2 | 1.0045 | 0.9697 | 0.0 | 0.0 | 1.4202 | 1.3891 | | BartForCausalLM | 4 | 1.0011 | 0.9698 | 0.758 | 0.0 | 1.4151 | 1.4143 | | RobertaForCausalLM | 64 | 1.0004 | 0.9603 | 0.7542 | 0.0 | 1.4004 | 1.3807 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.8829 | 0.7443 | 0.0 | 1.379 | 1.3854 | | DebertaForMaskedLM | 4 | 0.9208 | 0.7366 | 0.8007 | 0.0 | 1.2999 | 1.1375 | | BertForMaskedLM | 64 | 1.0005 | 0.9564 | 0.7403 | 0.0 | 1.2988 | 1.2848 | | PLBartForCausalLM | 32 | 1.0067 | 0.9416 | 0.7926 | 0.8407 | 1.2218 | 1.2467 | | BlenderbotSmallForCausalLM | 64 | 1.0018 | 0.9261 | 0.718 | 0.0 | 1.2135 | 1.2264 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9392 | 0.7091 | 0.4614 | 1.2126 | 1.2118 | | MBartForCausalLM | 32 | 1.0036 | 0.9427 | 0.7569 | 0.0 | 1.1666 | 1.1628 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9485 | 0.7578 | 0.0 | 1.1621 | 1.1628 | | DebertaForQuestionAnswering | 8 | 0.9861 | 0.8674 | 0.7219 | 0.0 | 1.1368 | 1.211 | | PegasusForCausalLM | 32 | 0.9991 | 0.9505 | 0.7532 | 0.8471 | 1.1354 | 1.1366 | | BigBird | 1 | 0.978 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | fail_to_run | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | fail_to_run | fail_to_run | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | fail_accuracy | fail_to_run | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | pass | pass | | AlbertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | fail_to_run | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | pass | pass | | CamemBert | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | fail_to_run | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | fail_to_run | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | pass | fail_to_run | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+----------------+-----------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ | DebertaForMaskedLM | 4 | 5.3801 | 11.2863 | 35.4169 | nan | 105.4539 | 39.865 | | DebertaForQuestionAnswering | 8 | 5.2502 | 11.0802 | 36.1816 | nan | 103.2721 | 39.8822 | | MobileBertForMaskedLM | 32 | 10.1116 | 35.1002 | 58.9281 | nan | 84.8629 | 81.364 | | MobileBertForQuestionAnswering | 64 | 10.3878 | 35.1481 | 58.0209 | nan | 82.8535 | 79.2318 | | XGLMForCausalLM | 8 | 3.179 | 13.6261 | 28.1182 | nan | 81.5125 | 79.9322 | | M2M100ForConditionalGeneration | 8 | 4.2771 | 15.8666 | 30.3578 | 424.5895 | 74.5191 | 70.9121 | | MBartForConditionalGeneration | 16 | 4.0745 | 17.4895 | 30.0541 | nan | 60.9653 | 59.3313 | | PegasusForConditionalGeneration | 16 | 3.8294 | 17.1739 | 27.4618 | 456.3515 | 60.6207 | 55.9846 | | BartForConditionalGeneration | 2 | 4.0219 | 17.3592 | nan | nan | 59.9891 | 57.8295 | | YituTechConvBert | 1 | 2.8404 | 11.0009 | 16.5533 | nan | 52.7953 | 48.6038 | | MegatronBertForCausalLM | 16 | 4.1548 | 14.6527 | 22.9506 | nan | 48.7612 | 46.4789 | | MegatronBertForQuestionAnswering | 16 | 3.9894 | 14.5425 | 22.8854 | nan | 47.1127 | 45.9904 | | MT5ForConditionalGeneration | 8 | 4.0593 | 13.2671 | 21.6436 | 182.287 | 44.9256 | 42.6721 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4997 | 11.5858 | 18.7286 | nan | 40.9287 | 39.0688 | | T5Small | 1 | 2.6591 | 9.1279 | 13.151 | 109.2223 | 33.6601 | 32.7428 | | T5ForConditionalGeneration | 4 | 2.6646 | 9.0338 | 13.4935 | 112.8446 | 33.5673 | 32.4119 | | PLBartForConditionalGeneration | 16 | 2.1074 | 8.7542 | 13.4868 | nan | 33.5127 | 33.565 | | LayoutLMForSequenceClassification | 16 | 2.3135 | 7.7566 | 11.8572 | nan | 31.3464 | 29.3416 | | ElectraForCausalLM | 32 | 2.0451 | 7.4345 | 11.4587 | nan | 30.7162 | 28.5186 | | PegasusForCausalLM | 32 | 1.5579 | 6.588 | 10.2721 | 137.4212 | 26.5697 | 24.9646 | | LayoutLMForMaskedLM | 16 | 2.4908 | 7.7934 | 12.034 | nan | 26.472 | 24.7561 | | MBartForCausalLM | 32 | 1.4904 | 6.6255 | 10.1448 | nan | 25.2115 | 23.8657 | | RobertaForCausalLM | 64 | 1.8812 | 7.3097 | 10.4825 | nan | 24.9204 | 24.3817 | | BertForMaskedLM | 64 | 1.8909 | 7.1967 | 11.0071 | nan | 24.4951 | 23.6636 | | ElectraForQuestionAnswering | 64 | 2.001 | 7.3002 | 10.797 | nan | 24.4523 | 23.0111 | | OPTForCausalLM | 32 | 1.5718 | 7.2784 | 11.4358 | 131.0921 | 24.0511 | 22.5163 | | TrOCRForCausalLM | 32 | 1.4793 | 6.6125 | 9.8507 | nan | 23.9797 | 23.0131 | | BartForCausalLM | 4 | 1.5506 | 6.6132 | 9.8652 | nan | 23.7612 | 22.66 | | BertForQuestionAnswering | 128 | 1.8734 | 7.2512 | 11.0258 | nan | 23.5406 | 22.859 | | RobertaForQuestionAnswering | 128 | 1.9098 | 7.1241 | 10.5937 | nan | 22.7278 | 21.4792 | | CamemBert | 1 | 1.9359 | 7.5018 | 10.3727 | nan | 21.8414 | 20.8427 | | AlbertForMaskedLM | 4 | 1.7175 | 7.3031 | nan | nan | 21.1479 | 20.3657 | | AlbertForQuestionAnswering | 4 | 1.8347 | 7.0632 | nan | nan | 20.5988 | 19.4607 | | GPT2ForSequenceClassification | 4 | 1.8037 | 6.5065 | nan | 110.2534 | 19.9998 | 19.5073 | | BlenderbotSmallForCausalLM | 64 | 1.0406 | 4.4795 | 6.8495 | nan | 17.6594 | 16.8046 | | Speech2Text2ForCausalLM | 128 | 0.9075 | 3.4969 | 5.4033 | 64.0122 | 16.2453 | 14.7561 | | PLBartForCausalLM | 32 | 0.8422 | 3.4917 | 4.9935 | 75.2534 | 15.1231 | 15.0981 | | DistilBertForMaskedLM | 64 | 0.8394 | 3.5858 | 6.2482 | 62.5082 | 14.807 | 14.1055 | | DistilBertForQuestionAnswering | 64 | 0.8397 | 3.7957 | 5.8508 | 68.829 | 14.2802 | 13.5993 | | DistillGPT2 | 1 | 0.9719 | 3.3796 | 4.7856 | nan | 14.0105 | 13.6432 | | BigBird | 1 | 4.0268 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | nan | 1.1872 | 1.0783 | 1.1717 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | nan | nan | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 0.3748 | nan | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | nan | nan | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.3632 | nan | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.3554 | nan | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.3384 | nan | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | nan | nan | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.3625 | 1.0966 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.3618 | 0.9973 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.4175 | 1.1 | 0.9327 | 0.9847 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.4377 | 1.1462 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.3662 | nan | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.396 | nan | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.3996 | nan | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.411 | nan | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.4044 | nan | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.3468 | 1.0551 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.3787 | nan | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 0.9999 | 0.8994 | 0.3788 | nan | 0.8828 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.3997 | nan | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.4002 | nan | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.4067 | 0.919 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.3725 | 1.0333 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.4146 | nan | 0.8523 | 0.9876 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.3635 | 1.0792 | 0.8215 | 0.8801 | | CamemBert | 1 | 0.999 | 0.8143 | 0.4159 | nan | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.4336 | nan | 0.8055 | 0.9516 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.4021 | nan | 0.8048 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.3532 | 1.0437 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.3978 | 0.9947 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.3928 | nan | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | 0.4317 | nan | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.3687 | nan | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9674 | 0.4275 | 1.0461 | 0.752 | 0.9892 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.3466 | nan | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.3107 | nan | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.3622 | nan | 0.409 | 1.026 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3251 | nan | 0.3071 | 1.1616 | | BigBird | 1 | 0.9748 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.4648 | 301.2449 | nan | nan | 163.2613 | 163.9857 | | AlbertForQuestionAnswering | 4 | 264.314 | 298.5267 | nan | nan | 160.8391 | 161.5659 | | BartForConditionalGeneration | 2 | 135.7444 | 140.5032 | nan | nan | 95.6537 | 97.8556 | | BlenderbotSmallForConditionalGeneration | 64 | 109.2364 | 127.0885 | 151.5615 | nan | 79.9387 | 79.588 | | BartForCausalLM | 4 | 111.9369 | 115.5414 | 147.9002 | nan | 79.15 | 79.0943 | | BertForQuestionAnswering | 128 | 110.4708 | 113.2358 | 142.0924 | nan | 76.9385 | 78.3261 | | RobertaForQuestionAnswering | 128 | 110.9423 | 112.6007 | 142.3053 | nan | 76.8231 | 78.8463 | | LayoutLMForMaskedLM | 16 | 111.9368 | 115.4 | 148.1047 | nan | 70.2275 | 70.8414 | | MBartForConditionalGeneration | 16 | 103.2824 | 126.8209 | 114.4297 | nan | 66.9643 | 70.8351 | | PegasusForConditionalGeneration | 16 | 104.1201 | 126.843 | 112.8051 | 164.4206 | 66.8 | 72.9854 | | DebertaForQuestionAnswering | 8 | 76.1169 | 86.5159 | 103.9189 | nan | 66.1531 | 61.7785 | | T5ForConditionalGeneration | 4 | 100.9954 | 112.8121 | 134.1462 | 86.6883 | 63.5187 | 64.378 | | PegasusForCausalLM | 32 | 68.7242 | 72.7706 | 91.5254 | 81.8106 | 60.5768 | 60.3738 | | MBartForCausalLM | 32 | 69.6191 | 74.0819 | 92.28 | nan | 59.9933 | 59.9371 | | TrOCRForCausalLM | 32 | 69.6037 | 75.1835 | 91.9451 | nan | 59.9421 | 59.9351 | | BertForMaskedLM | 64 | 75.4725 | 78.9032 | 101.889 | nan | 58.1885 | 58.7378 | | RobertaForCausalLM | 64 | 80.2354 | 83.6752 | 106.5029 | nan | 57.4648 | 58.2262 | | ElectraForQuestionAnswering | 64 | 114.7386 | 116.8161 | 149.0575 | nan | 56.3347 | 57.8761 | | LayoutLMForSequenceClassification | 16 | 97.1061 | 99.1705 | 125.3791 | nan | 54.1191 | 55.5783 | | MobileBertForQuestionAnswering | 64 | 190.5361 | 246.6218 | 118.0948 | nan | 53.3437 | 105.1289 | | XGLMForCausalLM | 8 | 87.3977 | 107.6528 | 93.7352 | nan | 52.8369 | 63.9088 | | M2M100ForConditionalGeneration | 8 | 124.6816 | 120.6299 | 88.5735 | 154.6503 | 50.6169 | 76.6523 | | DebertaForMaskedLM | 4 | 75.1184 | 97.4156 | 78.3828 | nan | 50.3674 | 56.7563 | | ElectraForCausalLM | 32 | 87.5247 | 93.7665 | 122.1338 | nan | 49.8239 | 49.7113 | | BlenderbotSmallForCausalLM | 64 | 58.6216 | 63.6498 | 81.5604 | nan | 48.3584 | 48.0312 | | MegatronBertForCausalLM | 16 | 87.7817 | 96.1121 | 83.9011 | nan | 47.167 | 57.5141 | | MobileBertForMaskedLM | 32 | 214.0348 | 241.6149 | 110.1628 | nan | 43.5724 | 101.4571 | | MegatronBertForQuestionAnswering | 16 | 79.9413 | 97.1358 | 76.7106 | nan | 43.4894 | 47.403 | | GPT2ForSequenceClassification | 4 | 91.9111 | 93.5004 | nan | 179.6119 | 39.0465 | 39.8145 | | T5Small | 1 | 63.1919 | 73.9268 | 53.1865 | 71.6808 | 38.9087 | 48.7533 | | DistilBertForMaskedLM | 64 | 45.0861 | 48.1106 | 63.7348 | 98.0482 | 37.2482 | 37.3007 | | OPTForCausalLM | 32 | 53.6738 | 58.4399 | 69.8753 | 159.2738 | 35.5267 | 35.821 | | PLBartForCausalLM | 32 | 39.0895 | 41.7897 | 49.4286 | 46.4865 | 31.6408 | 31.7126 | | PLBartForConditionalGeneration | 16 | 55.6642 | 66.8187 | 53.2622 | nan | 30.5678 | 34.4809 | | MT5ForConditionalGeneration | 8 | 104.1116 | 122.9308 | 57.7193 | 102.1241 | 26.588 | 37.1221 | | DistilBertForQuestionAnswering | 64 | 30.5677 | 33.1067 | 41.1162 | 84.0993 | 21.0854 | 21.7901 | | Speech2Text2ForCausalLM | 128 | 30.3003 | 32.4641 | 42.2807 | 37.5361 | 20.5193 | 20.4287 | | YituTechConvBert | 1 | 62.0851 | 74.0989 | 27.1879 | nan | 13.8072 | 39.9998 | | CamemBert | 1 | 37.0307 | 46.364 | 21.9437 | nan | 11.158 | 22.7364 | | DistillGPT2 | 1 | 20.2655 | 23.6782 | 15.9943 | nan | 8.0009 | 10.8269 | | BigBird | 1 | 192.3145 | nan | nan | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | regnety_002 | 128 | 0.9781 | 0.9404 | 1.1136 | 0.8617 | 2.1425 | 1.4351 | | ghostnet_100 | 128 | 1.0033 | 0.9796 | 0.8937 | 0.9925 | 2.1277 | 1.7897 | | xcit_large_24_p8_224 | 5 | 1.0008 | 0.0 | 0.0 | 0.0 | 2.1168 | 1.8655 | | lcnet_050 | 128 | 0.9658 | 0.947 | 0.8468 | 1.0335 | 2.0285 | 1.6218 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9969 | 0.0 | 0.0 | 1.9232 | 1.8934 | | twins_pcpvt_base | 64 | 1.0062 | 0.93 | 0.9617 | 0.0 | 1.756 | 1.64 | | hrnet_w18 | 128 | 1.0034 | 1.0277 | 0.8658 | 0.0 | 1.6901 | 1.4398 | | res2net101_26w_4s | 64 | 1.0038 | 1.0123 | 0.9467 | 0.0 | 1.6128 | 1.3283 | | coat_lite_mini | 128 | 1.0 | 0.9885 | 0.8421 | 1.1522 | 1.5891 | 1.5719 | | dla102 | 128 | 1.0 | 0.9958 | 0.8306 | 1.3151 | 1.5816 | 1.5483 | | nfnet_l0 | 128 | 0.999 | 0.8101 | 0.7108 | 0.8479 | 1.558 | 1.4681 | | volo_d1_224 | 64 | 0.9999 | 0.9938 | 0.839 | 0.0 | 1.5526 | 1.5209 | | resnest101e | 64 | 1.0036 | 0.991 | 0.8138 | 0.0 | 1.5479 | 1.5026 | | gmlp_s16_224 | 128 | 0.9999 | 0.9956 | 0.7866 | 1.0145 | 1.5229 | 1.5014 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 0.8543 | 1.1415 | 1.5057 | 1.4717 | | adv_inception_v3 | 128 | 0.9999 | 0.9964 | 0.8533 | 1.1424 | 1.5034 | 1.464 | | inception_v3 | 128 | 0.9998 | 0.9965 | 0.8532 | 1.1417 | 1.5005 | 1.4662 | | dm_nfnet_f0 | 128 | 0.9984 | 0.9993 | 0.8805 | 0.9227 | 1.5002 | 1.4296 | | gmixer_24_224 | 128 | 0.9999 | 0.8807 | 0.7214 | 0.9232 | 1.4936 | 1.4814 | | res2net50_14w_8s | 128 | 1.0001 | 0.9927 | 0.8097 | 0.9912 | 1.4852 | 1.4124 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9588 | 0.0 | 0.0 | 1.4813 | 1.4135 | | mobilenetv3_large_100 | 128 | 0.9531 | 0.9449 | 0.7832 | 0.9312 | 1.4485 | 1.4297 | | selecsls42b | 128 | 0.9999 | 0.9956 | 0.8424 | 1.2844 | 1.443 | 1.4108 | | res2next50 | 128 | 0.9994 | 0.9953 | 0.8336 | 1.1382 | 1.4175 | 1.3462 | | mnasnet_100 | 128 | 0.9535 | 0.9431 | 0.7895 | 1.1803 | 1.416 | 1.4608 | | cait_m36_384 | 4 | 1.0005 | 1.0096 | 0.0 | 0.0 | 1.4152 | 1.3657 | | fbnetv3_b | 128 | 0.9526 | 0.9397 | 0.7747 | 0.0 | 1.4041 | 1.3937 | | mobilenetv2_100 | 128 | 0.951 | 0.9421 | 0.7223 | 1.1218 | 1.4007 | 1.4335 | | crossvit_9_240 | 128 | 1.0001 | 0.9942 | 0.8382 | 0.9173 | 1.3954 | 1.3682 | | convit_base | 64 | 1.0 | 0.9968 | 0.8322 | 1.2379 | 1.3906 | 1.3175 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9642 | 0.7679 | 1.1266 | 1.3718 | 1.3793 | | mobilevit_s | 64 | 0.9732 | 0.8144 | 0.6562 | 0.0 | 1.3608 | 1.3593 | | jx_nest_base | 32 | 1.0 | 0.9925 | 0.7963 | 0.0 | 1.3602 | 1.3268 | | fbnetc_100 | 128 | 0.9523 | 0.9398 | 0.7932 | 1.1204 | 1.3521 | 1.3732 | | spnasnet_100 | 128 | 0.9461 | 0.936 | 0.778 | 1.0918 | 1.3507 | 1.3272 | | resmlp_12_224 | 128 | 1.0 | 0.9986 | 0.7831 | 1.4885 | 1.3303 | 1.2978 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 0.8072 | 0.0 | 1.326 | 1.2952 | | tf_efficientnet_b0 | 128 | 0.9652 | 0.8074 | 0.6667 | 0.9502 | 1.3246 | 1.3554 | | botnet26t_256 | 128 | 0.9783 | 0.9733 | 0.8124 | 1.2779 | 1.3236 | 1.3302 | | pit_b_224 | 64 | 0.9998 | 0.9953 | 0.8207 | 0.9715 | 1.3156 | 1.3091 | | pnasnet5large | 16 | 1.0051 | 1.0406 | 0.8454 | 0.0 | 1.3115 | 1.2719 | | cspdarknet53 | 64 | 0.9431 | 0.9343 | 0.7569 | 1.0914 | 1.3027 | 1.3242 | | rexnet_100 | 128 | 0.9656 | 0.8497 | 0.6913 | 0.0 | 1.2723 | 1.2774 | | tinynet_a | 128 | 0.9723 | 0.8029 | 0.6588 | 0.7806 | 1.2714 | 1.3288 | | eca_botnext26ts_256 | 128 | 0.9801 | 0.8115 | 0.6714 | 1.072 | 1.2712 | 1.2678 | | mixer_b16_224 | 128 | 0.9999 | 0.9976 | 0.8028 | 0.9024 | 1.2593 | 1.2499 | | beit_base_patch16_224 | 64 | 1.0 | 0.9785 | 0.0 | 0.0 | 1.2465 | 1.2307 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.9913 | 0.7969 | 0.9754 | 1.2391 | 1.222 | | visformer_small | 128 | 0.9996 | 0.999 | 0.8425 | 0.0 | 1.231 | 1.1753 | | dpn107 | 32 | 0.9569 | 0.9281 | 0.7566 | 0.0 | 1.2072 | 1.183 | | sebotnet33ts_256 | 64 | 0.9657 | 0.8369 | 0.6797 | 0.9712 | 1.2037 | 1.1982 | | tf_mixnet_l | 128 | 0.9785 | 0.9092 | 0.7936 | 0.0 | 1.1794 | 1.1732 | | mixnet_l | 128 | 0.9797 | 0.9055 | 0.7949 | 0.0 | 1.1618 | 1.1555 | | gluon_xception65 | 32 | 0.9996 | 0.99 | 0.7474 | 0.0 | 1.159 | 1.1246 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 0.8311 | 0.9109 | 1.1576 | 1.1465 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.9815 | 0.8092 | 0.0 | 1.1355 | 1.0556 | | repvgg_a2 | 128 | 0.9426 | 0.9346 | 0.7987 | 1.0684 | 1.1034 | 1.1196 | | gernet_l | 128 | 0.947 | 0.9378 | 0.7679 | 1.063 | 1.0641 | 1.0776 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 0.9233 | 0.0 | 1.056 | 1.0506 | | convnext_base | 64 | 0.9995 | 0.9953 | 0.8004 | 0.0 | 0.6631 | 0.6452 | | eca_halonext26ts | 128 | 0.9813 | 0.8163 | 0.679 | 0.0 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | fail_accuracy | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | dpn107 | 2 | pass | pass | pass | fail_to_run | pass | pass | | jx_nest_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | mobilevit_s | 2 | pass | pass | pass | fail_to_run | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | fail_to_run | pass | pass | | resnest101e | 2 | pass | pass | pass | fail_to_run | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | fail_to_run | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | fail_to_run | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | fail_to_run | fail_to_run | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | fail_to_run | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | fail_accuracy | fail_to_run | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | fail_accuracy | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | pass | fail_to_run | fail_to_run | fail_accuracy | | gluon_xception65 | 2 | pass | pass | pass | pass | fail_accuracy | fail_accuracy | | poolformer_m36 | 2 | pass | pass | pass | fail_to_run | fail_accuracy | fail_accuracy | | fbnetv3_b | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | pass | fail_accuracy | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+---------------+----------------+-----------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | hrnet_w18 | 128 | 6.931 | 30.4857 | 57.3067 | nan | 150.2292 | 136.4794 | | twins_pcpvt_base | 64 | 2.9951 | 15.3979 | 26.8133 | nan | 130.834 | 129.5663 | | pnasnet5large | 16 | 5.6391 | 23.8783 | 41.1797 | nan | 92.764 | 87.4451 | | xcit_large_24_p8_224 | 5 | 3.5596 | nan | nan | nan | 92.0008 | 88.4883 | | cait_m36_384 | 4 | 3.815 | 19.5988 | nan | nan | 86.6496 | 82.3462 | | swin_base_patch4_window7_224 | 64 | 3.2903 | 13.3388 | nan | nan | 82.786 | 80.3175 | | resnest101e | 64 | 3.7295 | 16.5156 | 27.453 | nan | 79.7768 | 72.6928 | | convnext_base | 64 | 1.5651 | 6.9611 | 11.4667 | nan | 77.0414 | 72.2102 | | mobilevit_s | 64 | 2.0202 | 7.6097 | 15.5645 | nan | 71.0608 | 67.871 | | jx_nest_base | 32 | 2.0348 | 9.2647 | 16.102 | nan | 65.5891 | 63.0269 | | res2net101_26w_4s | 64 | 3.5565 | 16.9651 | 28.2784 | nan | 64.5817 | 60.2342 | | coat_lite_mini | 128 | 1.3165 | 5.4821 | 8.4032 | 113.7861 | 61.5139 | 59.5614 | | res2net50_14w_8s | 128 | 3.1746 | 14.6573 | 24.9225 | 338.0982 | 57.6172 | 53.9023 | | poolformer_m36 | 64 | 1.9082 | 7.4302 | 12.2258 | nan | 56.0511 | 52.1659 | | sebotnet33ts_256 | 64 | 1.9509 | 6.2361 | 13.7414 | 150.4399 | 48.1373 | 46.034 | | gmlp_s16_224 | 128 | 1.4987 | 7.4523 | 12.3432 | 197.9731 | 47.2443 | 44.1018 | | dpn107 | 32 | 4.306 | 13.9007 | 39.7645 | nan | 47.0541 | 43.9241 | | crossvit_9_240 | 128 | 1.872 | 8.655 | 13.5658 | 211.5802 | 45.8403 | 43.3586 | | fbnetv3_b | 128 | 3.531 | 11.7421 | 28.2677 | nan | 45.6063 | 42.7345 | | gluon_xception65 | 32 | 2.3146 | 11.0104 | 18.8315 | nan | 45.2269 | 42.718 | | volo_d1_224 | 64 | 1.4525 | 7.6563 | 12.9226 | nan | 45.068 | 42.3737 | | tnt_s_patch16_224 | 128 | 2.0252 | 11.3096 | nan | nan | 43.6365 | 40.114 | | gluon_inception_v3 | 128 | 1.8479 | 8.4126 | 13.8175 | 190.1402 | 39.904 | 36.691 | | eca_botnext26ts_256 | 128 | 1.5477 | 5.0427 | 10.481 | 124.6177 | 39.792 | 39.2614 | | inception_v3 | 128 | 1.8263 | 8.4564 | 13.5193 | 192.7768 | 39.4032 | 36.5655 | | dla102 | 128 | 2.1101 | 9.6008 | 15.9518 | 256.3065 | 39.3235 | 36.4951 | | ghostnet_100 | 128 | 3.3877 | 9.9212 | 14.8002 | 199.2491 | 39.1227 | 36.6041 | | adv_inception_v3 | 128 | 1.8288 | 8.4327 | 13.5214 | 189.3149 | 38.6505 | 37.1302 | | gmixer_24_224 | 128 | 1.6172 | 8.3054 | 13.7966 | 188.9793 | 38.152 | 35.4031 | | tf_mixnet_l | 128 | 6.2003 | 12.9642 | 27.2885 | nan | 37.9262 | 36.0038 | | swsl_resnext101_32x16d | 32 | 2.196 | 9.2607 | 14.7766 | nan | 37.2668 | 34.7827 | | mixnet_l | 128 | 5.7372 | 12.878 | 26.4945 | nan | 37.1291 | 35.036 | | botnet26t_256 | 128 | 1.5761 | 4.4983 | 9.279 | 94.729 | 35.0677 | 34.1297 | | dm_nfnet_f0 | 128 | 2.3046 | 7.4564 | 11.0241 | 165.0416 | 33.9742 | 32.2667 | | res2next50 | 128 | 1.7858 | 8.2631 | 13.0833 | 205.2612 | 32.7063 | 30.2447 | | convit_base | 64 | 1.3665 | 6.2292 | 9.8919 | 148.4027 | 31.9169 | 30.7947 | | tinynet_a | 128 | 2.3455 | 8.1442 | 19.9872 | 202.1464 | 31.7491 | 30.1005 | | rexnet_100 | 128 | 2.1214 | 7.4928 | 17.1434 | nan | 31.5534 | 29.7933 | | tf_efficientnet_b0 | 128 | 2.0551 | 7.0695 | 16.2992 | 184.1684 | 27.8237 | 25.4269 | | cspdarknet53 | 64 | 2.6122 | 7.5394 | 18.6644 | 152.8653 | 27.1407 | 25.0663 | | spnasnet_100 | 128 | 2.3143 | 6.7729 | 17.1801 | 136.5093 | 26.5412 | 24.7965 | | mixer_b16_224 | 128 | 0.8987 | 3.7968 | 5.9729 | 87.4798 | 26.3356 | 25.4143 | | fbnetc_100 | 128 | 2.3512 | 7.072 | 17.4926 | 139.6215 | 25.8561 | 24.325 | | convmixer_768_32 | 32 | 1.3946 | 6.5936 | 9.9769 | nan | 25.7018 | 24.5606 | | pit_b_224 | 64 | 1.248 | 5.4003 | 8.8056 | 109.6848 | 25.1364 | 23.8347 | | deit_base_distilled_patch16_224 | 64 | 1.0536 | 5.3764 | 7.4986 | 88.2815 | 25.1177 | 25.2995 | | visformer_small | 128 | 1.0378 | 4.2325 | 6.5274 | nan | 25.1067 | 23.9944 | | vit_base_patch16_224 | 64 | 1.1538 | 4.7124 | 8.0766 | 90.9786 | 24.8839 | 23.7608 | | nfnet_l0 | 128 | 2.0544 | 7.5252 | 10.9787 | 150.1953 | 24.7692 | 22.9685 | | resmlp_12_224 | 128 | 0.7995 | 3.1912 | 4.872 | 50.0284 | 24.6682 | 22.7338 | | mobilenetv3_large_100 | 128 | 1.8934 | 5.835 | 13.4829 | 146.7477 | 23.9546 | 23.1319 | | beit_base_patch16_224 | 64 | 1.4003 | 5.8776 | nan | nan | 23.4187 | 21.9121 | | mobilenetv2_100 | 128 | 1.9196 | 5.658 | 12.9948 | 117.1492 | 22.5906 | 21.5039 | | repvgg_a2 | 128 | 2.1604 | 6.1376 | 15.4625 | 200.9345 | 22.2589 | 21.2079 | | mnasnet_100 | 128 | 1.8818 | 5.5318 | 13.3177 | 109.261 | 21.8075 | 19.8951 | | regnety_002 | 128 | 1.7886 | 5.8636 | 13.114 | 118.7427 | 21.8 | 20.2519 | | gernet_l | 128 | 2.1647 | 6.2133 | 15.5146 | 115.6391 | 20.999 | 19.9016 | | selecsls42b | 128 | 0.9436 | 3.8595 | 5.9259 | 91.239 | 18.5606 | 17.3978 | | lcnet_050 | 128 | 1.1515 | 3.4232 | 7.5048 | 83.467 | 15.2332 | 14.6341 | | ese_vovnet19b_dw | 128 | 1.1361 | 3.1755 | 6.8034 | 68.4644 | 14.4607 | 13.635 | | eca_halonext26ts | 128 | 1.6025 | 5.1343 | 11.0416 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 0.2766 | 0.4726 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 0.3052 | 0.5979 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9937 | 0.9715 | 0.3561 | 1.3557 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 0.2666 | 0.548 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 0.2734 | nan | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 0.3633 | nan | 1.1605 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 0.2849 | nan | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 0.2672 | 0.476 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 0.3413 | nan | 1.1021 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 0.3473 | nan | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 0.3109 | 0.9118 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | nan | nan | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 0.3333 | 1.3108 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 0.3556 | 0.4814 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 0.2681 | 0.3766 | 1.0332 | 1.1822 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 0.3421 | nan | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | nan | nan | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.3151 | nan | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.3455 | nan | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.3468 | nan | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.3362 | 0.9309 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.337 | 0.8972 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.3403 | nan | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.2877 | nan | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | nan | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.3302 | 0.7796 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | nan | nan | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.3273 | 0.8368 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.326 | 1.1764 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.3676 | nan | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.3529 | nan | 0.9069 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.3336 | nan | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.3342 | 0.8578 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.3349 | nan | 0.8975 | 0.9763 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.3307 | 0.7468 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.3347 | nan | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.3472 | 1.2311 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.3528 | 0.8765 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.3593 | 1.222 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.359 | 1.2167 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.3343 | 0.8371 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.3232 | 0.813 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.3202 | 0.8116 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.3349 | 0.824 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.2717 | nan | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.3444 | 0.8161 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.3241 | 0.8382 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.3308 | 0.7572 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.3359 | 0.8188 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.3403 | 0.7188 | 0.8371 | 1.0078 | | convnext_base | 64 | 1.003 | 0.9263 | 0.3509 | nan | 0.806 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.2624 | 1.0262 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.3212 | 0.5513 | 0.745 | 0.8294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.3514 | 1.1591 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.3347 | 1.2836 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.3399 | nan | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | nan | nan | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.3407 | 0.679 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | 0.2673 | nan | nan | nan | +---------------------------------+-----+--------+-----------+----------------+-----------------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | name | bs | eager | aot_eager | aot_cudagraphs | nvprims_nvfuser | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ | convmixer_768_32 | 32 | 296.486 | 296.8557 | 321.1375 | nan | 280.7942 | 282.1234 | | tnt_s_patch16_224 | 128 | 363.6214 | 364.7147 | nan | nan | 189.0574 | 191.9649 | | hrnet_w18 | 128 | 297.7562 | 289.8731 | 344.5862 | nan | 188.7442 | 221.1564 | | convnext_base | 64 | 121.4143 | 121.6429 | 151.4207 | nan | 183.0963 | 187.5732 | | pnasnet5large | 16 | 229.4869 | 221.4024 | 257.83 | nan | 168.9324 | 173.7203 | | tf_mixnet_l | 128 | 195.1447 | 210.0817 | 240.6266 | nan | 162.2243 | 162.9431 | | mixnet_l | 128 | 186.6718 | 201.9991 | 230.0394 | nan | 157.43 | 158.2623 | | convit_base | 64 | 181.2822 | 181.7059 | 217.6003 | 146.3074 | 130.2669 | 137.5216 | | pit_b_224 | 64 | 154.8196 | 155.4562 | 188.2924 | 159.2155 | 117.5385 | 118.1064 | | cait_m36_384 | 4 | 165.9859 | 164.6621 | nan | nan | 117.3606 | 121.8195 | | dla102 | 128 | 178.2148 | 179.1855 | 214.8749 | 135.5781 | 112.827 | 115.1884 | | poolformer_m36 | 64 | 148.8974 | 149.0187 | 183.9342 | nan | 112.0757 | 114.8132 | | beit_base_patch16_224 | 64 | 134.9152 | 137.7806 | nan | nan | 108.2304 | 109.6662 | | resnest101e | 64 | 167.9436 | 165.3773 | 199.6766 | nan | 108.0546 | 113.5857 | | adv_inception_v3 | 128 | 160.9935 | 161.5713 | 188.5554 | 140.888 | 107.1038 | 109.8873 | | inception_v3 | 128 | 160.6292 | 161.0654 | 188.1764 | 140.9007 | 107.0841 | 109.5081 | | gluon_inception_v3 | 128 | 160.974 | 161.5206 | 188.646 | 141.1272 | 107.0629 | 109.3358 | | vit_base_patch16_224 | 64 | 120.4637 | 121.1913 | 144.9533 | 132.1297 | 104.0355 | 104.985 | | swsl_resnext101_32x16d | 32 | 117.7744 | 120.0279 | 145.97 | nan | 103.9766 | 111.3971 | | res2net50_14w_8s | 128 | 145.4328 | 146.8044 | 179.7227 | 147.1104 | 99.6889 | 104.0853 | | swin_base_patch4_window7_224 | 64 | 147.0818 | 153.2668 | nan | nan | 99.4041 | 104.0568 | | res2next50 | 128 | 138.6325 | 138.6725 | 166.1529 | 121.6646 | 97.7916 | 102.4728 | | mixer_b16_224 | 128 | 118.3458 | 118.6056 | 147.521 | 131.0367 | 94.006 | 94.6202 | | dpn107 | 32 | 114.1541 | 115.883 | 142.7404 | nan | 93.7816 | 91.7772 | | gmlp_s16_224 | 128 | 136.292 | 136.5303 | 173.1511 | 134.0163 | 89.4771 | 90.6161 | | jx_nest_base | 32 | 118.8976 | 119.7334 | 149.3025 | nan | 87.3509 | 89.5918 | | dm_nfnet_f0 | 128 | 131.6929 | 131.5566 | 148.9204 | 142.1526 | 87.1907 | 91.5997 | | volo_d1_224 | 64 | 134.5478 | 134.9864 | 160.1128 | nan | 86.5848 | 88.2192 | | eca_botnext26ts_256 | 128 | 112.1036 | 135.4767 | 163.5368 | 102.4249 | 86.3948 | 86.6135 | | gluon_xception65 | 32 | 97.8576 | 98.5746 | 130.6599 | nan | 84.3137 | 86.6696 | | fbnetv3_b | 128 | 120.8277 | 122.5649 | 148.5787 | nan | 83.0267 | 84.5648 | | gmixer_24_224 | 128 | 119.7908 | 136.0844 | 166.2186 | 129.8953 | 80.2727 | 80.8411 | | visformer_small | 128 | 98.1431 | 97.9784 | 116.7458 | nan | 79.8902 | 83.4777 | | botnet26t_256 | 128 | 106.0373 | 106.5229 | 127.725 | 81.1549 | 78.4519 | 77.8833 | | crossvit_9_240 | 128 | 109.2776 | 109.8997 | 130.2487 | 119.1501 | 78.2862 | 79.7036 | | res2net101_26w_4s | 64 | 121.7017 | 129.0133 | 126.692 | nan | 77.7325 | 95.1401 | | twins_pcpvt_base | 64 | 125.2206 | 143.6159 | 138.8939 | nan | 76.4556 | 81.9545 | | deit_base_distilled_patch16_224 | 64 | 94.1628 | 94.926 | 117.9446 | 96.3779 | 75.9289 | 76.9224 | | coat_lite_mini | 128 | 115.747 | 117.2487 | 137.6673 | 100.5884 | 72.9963 | 73.6657 | | gernet_l | 128 | 79.6333 | 80.5474 | 98.5914 | 71.0486 | 70.9857 | 70.0816 | | cspdarknet53 | 64 | 95.9161 | 96.6293 | 119.4499 | 82.8709 | 69.3552 | 68.1578 | | rexnet_100 | 128 | 90.8942 | 103.0498 | 127.0586 | nan | 68.8717 | 68.7081 | | repvgg_a2 | 128 | 79.6315 | 80.322 | 94.1804 | 70.3765 | 68.2501 | 67.1488 | | nfnet_l0 | 128 | 106.2833 | 131.0196 | 148.4146 | 124.5304 | 68.098 | 72.2292 | | sebotnet33ts_256 | 64 | 83.2625 | 96.032 | 118.2151 | 82.8439 | 66.8136 | 66.9768 | | tf_efficientnet_b0 | 128 | 90.5795 | 108.2595 | 131.1144 | 92.0013 | 65.9276 | 64.3744 | | mobilevit_s | 64 | 89.9782 | 107.4419 | 133.4305 | nan | 64.2514 | 64.3669 | | xcit_large_24_p8_224 | 5 | 128.6823 | nan | nan | nan | 62.0273 | 73.0838 | | fbnetc_100 | 128 | 87.9137 | 88.9833 | 105.5735 | 74.7723 | 61.9827 | 60.9264 | | tinynet_a | 128 | 75.7975 | 90.8109 | 110.6837 | 99.1569 | 58.0368 | 60.6362 | | spnasnet_100 | 128 | 76.555 | 77.3926 | 93.1575 | 66.2718 | 53.6136 | 54.5766 | | resmlp_12_224 | 128 | 68.1068 | 68.3201 | 87.2123 | 45.765 | 51.2553 | 52.5835 | | ese_vovnet19b_dw | 128 | 67.7937 | 68.2858 | 85.9112 | 58.4775 | 47.9989 | 47.7794 | | mnasnet_100 | 128 | 69.989 | 70.7781 | 84.6785 | 56.6127 | 47.2192 | 45.7042 | | ghostnet_100 | 128 | 95.9114 | 97.6094 | 107.551 | 101.2856 | 46.0617 | 54.3073 | | mobilenetv2_100 | 128 | 67.4917 | 68.2319 | 88.9909 | 57.3132 | 45.9347 | 44.8565 | | selecsls42b | 128 | 62.7781 | 62.9914 | 74.5785 | 48.8287 | 43.545 | 44.503 | | mobilenetv3_large_100 | 128 | 66.0257 | 66.6214 | 80.2942 | 68.1801 | 43.415 | 44.0017 | | regnety_002 | 128 | 57.9244 | 60.1452 | 46.9374 | 65.124 | 25.5235 | 37.5032 | | lcnet_050 | 128 | 34.0898 | 34.6161 | 38.6199 | 33.3388 | 16.3715 | 20.7263 | | eca_halonext26ts | 128 | 115.8373 | 139.2614 | 167.7038 | nan | nan | nan | +---------------------------------+-----+----------+-----------+----------------+-----------------+----------+------------------------+ ~~~

Performance graphs

Build Summary

### Run name ### day_326_22_11_22_performance_amp_866 ### Commit hashes ### pytorch commit: 8f1ba95a426603ba43497ff1b7a2b94de311125e pytorch commit date: 2022-11-22 19:58:01+00:00 functorch Absent torchbench commit: 63d4037c8738908f3edfb3f7af69888378f57929 torchbench commit date: 2022-11-03 11:18:02-07:00 ### TorchDynamo config flags ### torch._dynamo.config.HAS_REFS_PRIMS = True torch._dynamo.config.capture_scalar_outputs = False torch._dynamo.config.dead_code_elimination = True torch._dynamo.config.dynamic_propagation = True torch._dynamo.config.dynamic_shapes = False torch._dynamo.config.enforce_cond_guards_match = True torch._dynamo.config.error_on_nested_fx_trace = True torch._dynamo.config.fake_tensor_propagation = True torch._dynamo.config.guard_nn_modules = False torch._dynamo.config.normalize_ir = False torch._dynamo.config.optimize_ddp = False torch._dynamo.config.print_graph_breaks = False torch._dynamo.config.raise_on_ctx_manager_usage = True torch._dynamo.config.raise_on_unsafe_aot_autograd = False torch._dynamo.config.replay_record_enabled = False torch._dynamo.config.specialize_int_float = True torch._dynamo.config.suppress_errors = False torch._dynamo.config.verbose = False torch._dynamo.config.verify_correctness = False ### Torch version ### torch: 1.14.0.dev20221114+cu116 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8302 Number CUDA Devices: 1 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.314694656

williamwen42 commented 2 years ago

Performance Dashboard for amp precision

Executive Summary

We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio. Caveats 1) Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint. 2) Experiments do not cover dynamic shapes. 3) Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 96%, 52/54 | 98%, 41/42  | 98%, 60/61  |
|       aot_eager        | 94%, 51/54 | 95%, 40/42  | 93%, 57/61  |
|        inductor        | 81%, 44/54 | 90%, 38/42  | 90%, 55/61  |
| inductor_no_cudagraphs | 85%, 46/54 | 90%, 38/42  | 90%, 55/61  |
+------------------------+------------+-------------+-------------+

Geometric mean speedup

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   1.00x    |    1.01x    |    1.00x    |
|       aot_eager        |   1.01x    |    1.00x    |    1.00x    |
|        inductor        |   1.84x    |    1.74x    |    1.41x    |
| inductor_no_cudagraphs |   1.38x    |    1.53x    |    1.36x    |
+------------------------+------------+-------------+-------------+

Mean compilation time (seconds)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |    2.06    |    2.84     |    2.33     |
|       aot_eager        |    6.61    |    10.24    |    8.69     |
|        inductor        |   33.97    |    38.49    |    44.16    |
| inductor_no_cudagraphs |   34.21    |    33.58    |    41.73    |
+------------------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          |   0.97x    |    0.99x    |    0.99x    |
|       aot_eager        |   0.84x    |    0.89x    |    0.87x    |
|        inductor        |   0.83x    |    0.85x    |    0.94x    |
| inductor_no_cudagraphs |   0.96x    |    1.01x    |    1.05x    |
+------------------------+------------+-------------+-------------+

Summary Statistics Diff

For each relevant compiler, we compare the summary statistics for the most 2 recent reports that actually run the compiler. Current report name: /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name: /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_568 Passrate diff ~~~ +------------------------+-------------+------------+------------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+------------+ | inductor | torchbench | 85%, 46/54 | 85%, 46/54 | | inductor | timm_models | 90%, 55/61 | 93%, 57/61 | | inductor_no_cudagraphs | torchbench | 87%, 47/54 | 87%, 47/54 | | inductor_no_cudagraphs | timm_models | 90%, 55/61 | 93%, 57/61 | +------------------------+-------------+------------+------------+ ~~~ Geometric mean speedup diff ~~~ +------------------------+-------------+------------+-----------+ | compiler | suite | prev_value | cur_value | +------------------------+-------------+------------+-----------+ | inductor | torchbench | 1.86x | 1.88x | | inductor | timm_models | 1.43x | 1.43x | | inductor_no_cudagraphs | torchbench | 1.39x | 1.38x | | inductor_no_cudagraphs | timm_models | 1.38x | 1.37x | +------------------------+-------------+------------+-----------+ ~~~

Warnings

We flag models where: - accuracy fails - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test) - compilation latency > 120 sec. - compression ratio < 0.9 Accuracy warnings ~~~ +-------------+--------------------------------+---------------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+--------------------------------+---------------+------------------------+ | torchbench | hf_BigBird | fail_to_run | fail_to_run | | torchbench | moco | fail_to_run | fail_to_run | | torchbench | hf_Longformer | fail_to_run | fail_to_run | | torchbench | tacotron2 | fail_to_run | pass | | torchbench | vision_maskrcnn | fail_to_run | fail_to_run | | torchbench | timm_efficientdet | fail_to_run | fail_to_run | | torchbench | dlrm | fail_to_run | fail_to_run | | torchbench | functorch_dp_cifar10 | fail_accuracy | fail_accuracy | | torchbench | mobilenet_v3_large | fail_accuracy | fail_accuracy | | torchbench | tts_angular | 0.0000 | 0.0000 | | huggingface | MBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | PLBartForConditionalGeneration | fail_to_run | fail_to_run | | huggingface | BigBird | fail_to_run | fail_to_run | | huggingface | AllenaiLongformerBase | fail_to_run | fail_to_run | | timm_models | convit_base | fail_to_run | fail_to_run | | timm_models | eca_halonext26ts | fail_to_run | fail_accuracy | | timm_models | fbnetv3_b | fail_accuracy | fail_accuracy | | timm_models | gluon_xception65 | fail_accuracy | fail_accuracy | | timm_models | poolformer_m36 | fail_accuracy | fail_accuracy | | timm_models | spnasnet_100 | fail_accuracy | fail_accuracy | +-------------+--------------------------------+---------------+------------------------+ ~~~ Performance speedup warnings ~~~ +-------------+-----------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-----------------------+----------+------------------------+ | torchbench | hf_GPT2_large | 0.0 | 1.8633 | | torchbench | tacotron2 | 0.0 | 0.8824 | | torchbench | dlrm | 0.0 | 0.0 | | torchbench | hf_BigBird | 0.0 | 0.0 | | torchbench | hf_Longformer | 0.0 | 0.0 | | torchbench | moco | 0.0 | 0.0 | | huggingface | BigBird | 0.0 | 0.0 | | huggingface | AllenaiLongformerBase | 0.0 | 0.0 | | timm_models | convnext_base | 0.6631 | 0.6452 | | timm_models | eca_halonext26ts | 0.0 | 0.0 | +-------------+-----------------------+----------+------------------------+ ~~~ Compilation latency (sec) warnings ~~~ +-------------+-------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+-------------------+----------+------------------------+ | torchbench | yolov3 | 404.1995 | 416.489 | | torchbench | timm_efficientdet | 146.2678 | 144.8974 | | torchbench | hf_T5_large | 145.3088 | 139.5987 | | timm_models | hrnet_w18 | 150.2292 | 136.4794 | | timm_models | twins_pcpvt_base | 130.834 | 129.5663 | +-------------+-------------------+----------+------------------------+ ~~~ Peak Memory Compression Ratio warnings ~~~ +-------------+----------------------------------+----------+------------------------+ | suite | name | inductor | inductor_no_cudagraphs | +-------------+----------------------------------+----------+------------------------+ | torchbench | speech_transformer | 0.8824 | 0.8866 | | torchbench | timm_vision_transformer_large | 0.879 | 1.0245 | | torchbench | BERT_pytorch | 0.8778 | 1.0948 | | torchbench | timm_resnest | 0.8759 | 0.9953 | | torchbench | densenet121 | 0.8753 | 1.0051 | | torchbench | squeezenet1_1 | 0.8735 | 1.0608 | | torchbench | hf_Bert | 0.8728 | 0.942 | | torchbench | shufflenet_v2_x1_0 | 0.8692 | 0.9802 | | torchbench | resnet50 | 0.8659 | 0.885 | | torchbench | hf_T5_large | 0.8541 | 0.8541 | | torchbench | hf_DistilBert | 0.8348 | 0.9049 | | torchbench | fastNLP_Bert | 0.8013 | 1.0681 | | torchbench | alexnet | 0.7973 | 1.0079 | | torchbench | hf_Bart | 0.7933 | 0.9724 | | torchbench | mobilenet_v3_large | 0.791 | 0.8143 | | torchbench | timm_vovnet | 0.7799 | 0.8875 | | torchbench | pytorch_stargan | 0.7783 | 0.8847 | | torchbench | resnext50_32x4d | 0.7644 | 0.7753 | | torchbench | vgg16 | 0.7633 | 1.0588 | | torchbench | mnasnet1_0 | 0.7541 | 0.7741 | | torchbench | drq | 0.752 | 0.9256 | | torchbench | soft_actor_critic | 0.7295 | 1.0368 | | torchbench | LearningToPaint | 0.7295 | 0.925 | | torchbench | timm_vision_transformer | 0.7133 | 0.7227 | | torchbench | resnet18 | 0.6102 | 0.6257 | | torchbench | hf_Reformer | 0.5851 | 1.0014 | | torchbench | lennard_jones | 0.564 | 0.9991 | | torchbench | nvidia_deeprecommender | 0.5596 | 0.5596 | | torchbench | functorch_dp_cifar10 | 0.4481 | 0.4691 | | torchbench | pytorch_struct | 0.4235 | 0.4353 | | torchbench | dcgan | 0.2123 | 0.2137 | | torchbench | tacotron2 | nan | 0.4112 | | huggingface | MegatronBertForQuestionAnswering | 0.893 | 1.0053 | | huggingface | MegatronBertForCausalLM | 0.8919 | 1.0207 | | huggingface | DistilBertForQuestionAnswering | 0.89 | 0.9848 | | huggingface | BertForMaskedLM | 0.8834 | 0.9285 | | huggingface | RobertaForCausalLM | 0.8828 | 0.9282 | | huggingface | TrOCRForCausalLM | 0.8816 | 0.9425 | | huggingface | MBartForConditionalGeneration | 0.8755 | 1.0595 | | huggingface | MT5ForConditionalGeneration | 0.875 | 0.919 | | huggingface | OPTForCausalLM | 0.8727 | 0.9449 | | huggingface | PLBartForConditionalGeneration | 0.8523 | 0.9876 | | huggingface | DistilBertForMaskedLM | 0.8215 | 0.8801 | | huggingface | CamemBert | 0.8065 | 0.9306 | | huggingface | XGLMForCausalLM | 0.8055 | 0.9516 | | huggingface | DistillGPT2 | 0.8048 | 0.9949 | | huggingface | Speech2Text2ForCausalLM | 0.8039 | 0.898 | | huggingface | PLBartForCausalLM | 0.7975 | 0.8675 | | huggingface | ElectraForCausalLM | 0.7949 | 0.8607 | | huggingface | YituTechConvBert | 0.7909 | 0.9314 | | huggingface | BlenderbotSmallForCausalLM | 0.778 | 0.859 | | huggingface | M2M100ForConditionalGeneration | 0.752 | 0.9892 | | huggingface | MobileBertForMaskedLM | 0.5931 | 0.7994 | | huggingface | MobileBertForQuestionAnswering | 0.4995 | 0.635 | | huggingface | DebertaForMaskedLM | 0.409 | 1.026 | | huggingface | DebertaForQuestionAnswering | 0.3071 | 1.1616 | | timm_models | res2net101_26w_4s | 0.8977 | 0.973 | | timm_models | inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_inception_v3 | 0.8975 | 1.0248 | | timm_models | adv_inception_v3 | 0.8975 | 1.0248 | | timm_models | gluon_xception65 | 0.8975 | 0.9763 | | timm_models | fbnetc_100 | 0.8973 | 0.9876 | | timm_models | hrnet_w18 | 0.8969 | 1.0032 | | timm_models | mixer_b16_224 | 0.8927 | 0.963 | | timm_models | selecsls42b | 0.8926 | 0.9897 | | timm_models | vit_base_patch16_224 | 0.8877 | 0.8929 | | timm_models | deit_base_distilled_patch16_224 | 0.8872 | 0.8923 | | timm_models | spnasnet_100 | 0.8795 | 0.9819 | | timm_models | res2net50_14w_8s | 0.877 | 0.9738 | | timm_models | res2next50 | 0.8719 | 0.9671 | | timm_models | mnasnet_100 | 0.871 | 0.9804 | | timm_models | mixnet_l | 0.8701 | 1.0089 | | timm_models | gernet_l | 0.8619 | 0.9858 | | timm_models | cspdarknet53 | 0.8607 | 1.0102 | | timm_models | botnet26t_256 | 0.8503 | 0.9434 | | timm_models | lcnet_050 | 0.8449 | 0.9432 | | timm_models | regnety_002 | 0.8371 | 1.0078 | | timm_models | convnext_base | 0.806 | 0.9865 | | timm_models | resmlp_12_224 | 0.7981 | 0.8121 | | timm_models | sebotnet33ts_256 | 0.745 | 0.8294 | | timm_models | coat_lite_mini | 0.7194 | 1.0197 | | timm_models | crossvit_9_240 | 0.7141 | 0.9624 | | timm_models | jx_nest_base | 0.6644 | 0.8514 | | timm_models | swin_base_patch4_window7_224 | 0.6295 | 0.7419 | | timm_models | repvgg_a2 | 0.5534 | 0.8298 | +-------------+----------------------------------+----------+------------------------+ ~~~

Metrics over time

../test-dynamo-runner-logs-12/memory_over_time.png : ![](https://i.imgur.com/epZJ9SJ.png) ../test-dynamo-runner-logs-12/geomean_over_time.png : ![](https://i.imgur.com/MiVi497.png) ../test-dynamo-runner-logs-12/passrate_over_time.png : ![](https://i.imgur.com/5HpoRnI.png) ../test-dynamo-runner-logs-12/comp_time_over_time.png : ![](https://i.imgur.com/DTBiFkQ.png)

Recent Regressions

For each relevant compiler, we compare the most recent 2 reports (that actually run the compiler) to find previously unflagged models that are now flagged as problematic (according to the 'Warnings' section). ### Regressions for torchbench ### Current report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_568 Current report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor_no_cudagraphs, suite: torchbench): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_568 Compilation latency (sec) regressions ~~~ +------------------------+-------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+-------------+-------------+------------+ | inductor_no_cudagraphs | hf_T5_large | 117.7331 | 127.2357 | +------------------------+-------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+--------------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+--------------------+-------------+------------+ | inductor | shufflenet_v2_x1_0 | 0.9098 | 0.8692 | | inductor_no_cudagraphs | dcgan | 0.9695 | 0.2137 | +------------------------+--------------------+-------------+------------+ ~~~ ### Regressions for huggingface ### Current report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_327_23_11_22_performance_amp_433 Current report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor_no_cudagraphs, suite: huggingface): /data/home/anijain/cluster/cron_logs/day_327_23_11_22_performance_amp_433 No regressions found. ### Regressions for timm_models ### Current report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_568 Current report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_717 Previous report name (compiler: inductor_no_cudagraphs, suite: timm_models): /data/home/anijain/cluster/cron_logs/day_326_22_11_22_performance_amp_568 Performance speedup regressions ~~~ +------------------------+---------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+---------------+-------------+------------+ | inductor | convnext_base | 1.2419 | 0.691 | | inductor_no_cudagraphs | convnext_base | 1.2725 | 0.6793 | +------------------------+---------------+-------------+------------+ ~~~ Peak Memory Compression Ratio regressions ~~~ +------------------------+---------------+-------------+------------+ | compiler | name | prev_status | cur_status | +------------------------+---------------+-------------+------------+ | inductor | convnext_base | 0.9013 | 0.8973 | | inductor | cspdarknet53 | 0.9053 | 0.8607 | | inductor_no_cudagraphs | repvgg_a2 | 0.9914 | 0.8298 | +------------------------+---------------+-------------+------------+ ~~~

torchbench suite with amp precision

Performance speedup ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | densenet121 | 4 | 1.0021 | 0.9269 | 6.1007 | 1.3179 | | functorch_dp_cifar10 | 64 | 1.0025 | 0.959 | 5.0593 | 0.9792 | | timm_efficientdet | 1 | 0.9846 | 0.8224 | 4.754 | 1.5319 | | resnext50_32x4d | 8 | 1.0029 | 0.9629 | 3.5498 | 1.2678 | | timm_vision_transformer | 8 | 1.0015 | 0.8456 | 3.4415 | 1.532 | | BERT_pytorch | 16 | 1.0065 | 0.8313 | 3.366 | 2.332 | | mobilenet_v3_large | 32 | 1.0033 | 1.0061 | 3.0827 | 1.3913 | | drq | 1 | 1.0088 | 0.8228 | 3.0015 | 1.1596 | | dcgan | 32 | 0.9819 | 0.9163 | 2.8668 | 1.0467 | | resnet18 | 16 | 1.0017 | 0.997 | 2.8116 | 1.2074 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.9956 | 0.976 | 2.7857 | 1.5668 | | hf_T5_large | 2 | 1.0196 | 0.8562 | 2.6305 | 2.1346 | | mnasnet1_0 | 32 | 1.0 | 1.021 | 2.6232 | 1.3497 | | squeezenet1_1 | 32 | 0.9942 | 0.9626 | 2.4487 | 1.3039 | | hf_Albert | 8 | 1.0025 | 0.9621 | 2.3629 | 2.2746 | | hf_GPT2 | 4 | 1.0238 | 0.9834 | 2.128 | 1.9203 | | pytorch_struct | 200 | 0.9858 | 0.7499 | 2.1278 | 1.28 | | timm_efficientnet | 32 | 0.9617 | 0.819 | 2.1064 | 1.2819 | | hf_Bert | 4 | 1.0358 | 0.8393 | 2.0757 | 1.8356 | | lennard_jones | 1000 | 0.9695 | 0.7698 | 2.0722 | 1.0623 | | resnet152 | 32 | 1.0018 | 1.0101 | 2.0638 | 1.3011 | | timm_resnest | 32 | 1.0068 | 1.0167 | 1.9156 | 1.6651 | | hf_T5 | 8 | 0.9997 | 0.919 | 1.8668 | 1.8751 | | resnet50 | 32 | 1.0015 | 1.0246 | 1.8012 | 1.3458 | | LearningToPaint | 96 | 1.003 | 1.0147 | 1.7935 | 1.3141 | | hf_Bart | 4 | 1.0128 | 0.8329 | 1.758 | 1.8321 | | soft_actor_critic | 256 | 1.0176 | 0.7414 | 1.746 | 1.0551 | | shufflenet_v2_x1_0 | 128 | 1.0003 | 1.0223 | 1.703 | 1.4324 | | mobilenet_v2 | 96 | 1.0001 | 1.0065 | 1.5589 | 1.5181 | | speech_transformer | 32 | 0.9559 | 0.8244 | 1.5304 | 1.5474 | | attention_is_all_you_need_pytorch | 256 | 1.0068 | 0.9027 | 1.5285 | 1.58 | | timm_nfnet | 128 | 0.9991 | 1.0 | 1.5078 | 1.4307 | | fastNLP_Bert | 6 | 0.9992 | 0.8893 | 1.5043 | 1.4513 | | hf_DistilBert | 8 | 1.0017 | 0.9746 | 1.492 | 1.4593 | | pytorch_stargan | 16 | 0.9951 | 1.0961 | 1.4619 | 1.5082 | | pytorch_unet | 1 | 0.9996 | 0.9921 | 1.3621 | 1.331 | | timm_regnet | 32 | 0.9786 | 0.9422 | 1.3385 | 1.2223 | | timm_vovnet | 32 | 0.9205 | 0.8797 | 1.2996 | 1.1491 | | vgg16 | 64 | 0.9996 | 0.9972 | 1.2708 | 1.2639 | | Background_Matting | 4 | 0.9999 | 1.0155 | 1.2373 | 1.2197 | | Super_SloMo | 6 | 0.9993 | 0.995 | 1.2277 | 1.1941 | | alexnet | 128 | 0.999 | 0.9977 | 1.2089 | 1.2102 | | hf_Reformer | 4 | 0.9987 | 1.0002 | 1.1761 | 1.1801 | | timm_vision_transformer_large | 8 | 0.9999 | 0.9903 | 1.0903 | 1.0719 | | yolov3 | 16 | 0.9997 | 0.9906 | 1.0881 | 1.0689 | | tts_angular | 64 | 0.975 | 0.9437 | 1.0167 | 1.0065 | | demucs | 4 | 1.0014 | 1.0 | 1.0017 | 1.0006 | | nvidia_deeprecommender | 256 | 0.9989 | 0.996 | 0.9892 | 1.0305 | | hf_GPT2_large | 4 | 1.0002 | 0.9907 | 0.0 | 1.8633 | | tacotron2 | 64 | 0.988 | 0.7645 | 0.0 | 0.8824 | | dlrm | 2048 | 1.01 | 1.1541 | 0.0 | 0.0 | | hf_BigBird | 2 | 0.9843 | 0.0 | 0.0 | 0.0 | | hf_Longformer | 0 | 0.0 | 0.0 | 0.0 | 0.0 | | moco | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ | hf_GPT2_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | timm_vision_transformer_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | hf_T5_large | 2 | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | | soft_actor_critic | 256 | pass | pass | pass | pass | | pytorch_CycleGAN_and_pix2pix | 1 | pass | pass | pass | pass | | pytorch_stargan | 16 | pass | pass | pass | pass | | pytorch_unet | 2 | pass | pass | pass | pass | | resnet152 | 2 | pass | pass | pass | pass | | resnet18 | 2 | pass | pass | pass | pass | | resnet50 | 2 | pass | pass | pass | pass | | resnext50_32x4d | 2 | pass | pass | pass | pass | | shufflenet_v2_x1_0 | 2 | pass | pass | pass | pass | | speech_transformer | 2 | pass | pass | pass | pass | | mobilenet_v2 | 2 | pass | pass | pass | pass | | squeezenet1_1 | 2 | pass | pass | pass | pass | | timm_efficientnet | 2 | pass | pass | pass | pass | | timm_nfnet | 2 | pass | pass | pass | pass | | timm_regnet | 2 | pass | pass | pass | pass | | timm_resnest | 2 | pass | pass | pass | pass | | timm_vision_transformer | 2 | pass | pass | pass | pass | | timm_vovnet | 2 | pass | pass | pass | pass | | vgg16 | 2 | pass | pass | pass | pass | | yolov3 | 2 | pass | pass | pass | pass | | nvidia_deeprecommender | 2 | pass | pass | pass | pass | | pytorch_struct | 200 | pass | pass | pass | pass | | mnasnet1_0 | 2 | pass | pass | pass | pass | | drq | 1 | pass | pass | pass | pass | | BERT_pytorch | 2 | pass | pass | pass | pass | | Background_Matting | 4 | pass | pass | pass | pass | | LearningToPaint | 2 | pass | pass | pass | pass | | Super_SloMo | 2 | pass | pass | pass | pass | | alexnet | 2 | pass | pass | pass | pass | | lennard_jones | 2 | pass | pass | pass | pass | | dcgan | 2 | pass | pass | pass | pass | | demucs | 4 | pass | pass | pass | pass | | densenet121 | 2 | pass | pass | pass | pass | | attention_is_all_you_need_pytorch | 2 | pass | pass | pass | pass | | fastNLP_Bert | 2 | pass | pass | pass | pass | | hf_Bart | 2 | pass | pass | pass | pass | | hf_Bert | 2 | pass | pass | pass | pass | | hf_DistilBert | 2 | pass | pass | pass | pass | | hf_GPT2 | 2 | pass | pass | pass | pass | | hf_Reformer | 2 | pass | pass | pass | pass | | hf_T5 | 2 | pass | pass | pass | pass | | hf_T5_base | 2 | pass | pass | pass | pass | | hf_Albert | 2 | pass | pass | pass | pass | | hf_BigBird | 2 | pass | fail_to_run | fail_to_run | fail_to_run | | moco | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | hf_Longformer | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | tacotron2 | 2 | pass | pass | fail_to_run | pass | | vision_maskrcnn | 2 | pass | pass | fail_to_run | fail_to_run | | timm_efficientdet | 2 | pass | pass | fail_to_run | fail_to_run | | dlrm | 2 | pass | pass | fail_to_run | fail_to_run | | functorch_dp_cifar10 | 2 | pass | pass | fail_accuracy | fail_accuracy | | mobilenet_v3_large | 2 | pass | pass | fail_accuracy | fail_accuracy | | tts_angular | 2 | pass | pass | 0.0000 | 0.0000 | +-----------------------------------+-----+------------------+------------------+------------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------+------+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+---------+-----------+----------+------------------------+ | yolov3 | 16 | 3.1344 | 8.4555 | 404.1995 | 416.489 | | timm_efficientdet | 1 | 20.2634 | 39.3318 | 146.2678 | 144.8974 | | hf_T5_large | 2 | 14.8888 | 39.8919 | 145.3088 | 139.5987 | | timm_vision_transformer_large | 8 | 3.0569 | 15.4787 | 72.4614 | 69.0113 | | resnet152 | 32 | 2.7633 | 14.3372 | 53.5223 | 52.8041 | | densenet121 | 4 | 2.4205 | 12.1053 | 52.0325 | 51.0502 | | attention_is_all_you_need_pytorch | 256 | 1.4406 | 7.2285 | 40.3385 | 39.5388 | | timm_resnest | 32 | 0.6749 | 2.5525 | 39.2066 | 38.0779 | | speech_transformer | 32 | 2.0109 | 8.9765 | 36.4502 | 34.9328 | | hf_Bart | 4 | 2.0912 | 9.0258 | 36.1823 | 35.5936 | | timm_vision_transformer | 8 | 1.031 | 4.6098 | 35.9455 | 35.3652 | | BERT_pytorch | 16 | 1.8364 | 7.6958 | 35.846 | 35.7492 | | fastNLP_Bert | 6 | 1.9116 | 7.2674 | 33.1912 | 30.7348 | | timm_nfnet | 128 | 2.2018 | 7.4185 | 32.7195 | 32.5884 | | hf_T5 | 8 | 2.7481 | 9.1221 | 32.4549 | 31.0626 | | timm_regnet | 32 | 2.4918 | 8.6327 | 28.8942 | 28.5276 | | pytorch_stargan | 16 | 0.4649 | 2.1492 | 28.1889 | 26.0531 | | timm_efficientnet | 32 | 1.9295 | 7.3003 | 27.4539 | 27.061 | | mobilenet_v3_large | 32 | 1.0471 | 4.797 | 26.0092 | 25.9043 | | hf_Bert | 4 | 1.8867 | 7.2902 | 24.4481 | 23.5085 | | hf_Albert | 8 | 1.6492 | 6.7058 | 23.2417 | 22.1951 | | functorch_dp_cifar10 | 64 | 0.3445 | 1.4309 | 22.6126 | 22.7953 | | pytorch_struct | 200 | 0.2883 | 0.8641 | 22.4876 | 22.2684 | | mnasnet1_0 | 32 | 0.9474 | 4.3783 | 21.6773 | 21.1308 | | hf_GPT2 | 4 | 1.8656 | 6.5011 | 21.0139 | 20.0732 | | resnet50 | 32 | 1.0144 | 4.9032 | 20.7345 | 20.4962 | | shufflenet_v2_x1_0 | 128 | 1.1795 | 5.4288 | 20.552 | 20.3635 | | resnext50_32x4d | 8 | 1.0804 | 4.6139 | 20.3942 | 19.7537 | | timm_vovnet | 32 | 1.6063 | 4.5008 | 20.2442 | 19.9906 | | mobilenet_v2 | 96 | 0.9566 | 4.9474 | 19.8935 | 19.3603 | | Background_Matting | 4 | 0.9599 | 4.4259 | 19.0115 | 17.7799 | | hf_Reformer | 4 | 1.6744 | 3.0553 | 18.9639 | 16.267 | | Super_SloMo | 6 | 0.9908 | 4.0544 | 17.5075 | 16.591 | | hf_DistilBert | 8 | 0.8338 | 3.5538 | 15.5308 | 14.8387 | | resnet18 | 16 | 0.4733 | 1.8125 | 11.5643 | 11.5422 | | dcgan | 32 | 0.1827 | 0.4312 | 10.388 | 9.9073 | | pytorch_CycleGAN_and_pix2pix | 1 | 0.4759 | 2.0248 | 9.1949 | 9.0932 | | pytorch_unet | 1 | 0.4486 | 1.9193 | 8.4807 | 8.2167 | | LearningToPaint | 96 | 0.4988 | 1.9185 | 8.2431 | 7.8678 | | squeezenet1_1 | 32 | 0.2749 | 0.9414 | 4.7654 | 4.5126 | | vgg16 | 64 | 0.209 | 0.6473 | 4.2742 | 3.9492 | | drq | 1 | 0.3217 | 0.6423 | 4.2633 | 3.6368 | | nvidia_deeprecommender | 256 | 0.2211 | 0.5266 | 3.745 | 3.4994 | | soft_actor_critic | 256 | 0.2103 | 0.3601 | 3.5436 | 3.0174 | | alexnet | 128 | 0.1783 | 0.4468 | 3.325 | 3.3008 | | lennard_jones | 1000 | 0.1589 | 0.367 | 2.3328 | 1.9799 | | tts_angular | 64 | 0.1937 | 0.2399 | 1.9197 | 1.7273 | | demucs | 4 | 0.3371 | 0.3585 | 0.2731 | 0.2673 | | hf_GPT2_large | 4 | 5.7771 | 20.2502 | nan | 58.0332 | | tacotron2 | 64 | 6.9867 | 20.1316 | nan | 45.901 | | dlrm | 2048 | 0.4851 | 0.8588 | nan | nan | | hf_BigBird | 2 | 4.0095 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | +-----------------------------------+------+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------+------+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+--------+-----------+----------+------------------------+ | timm_efficientnet | 32 | 0.988 | 0.7698 | 1.2042 | 1.2318 | | mobilenet_v2 | 96 | 0.9857 | 0.7639 | 1.0606 | 1.1512 | | Super_SloMo | 6 | 1.0024 | 0.9645 | 1.0541 | 1.3039 | | timm_nfnet | 128 | 0.9693 | 0.8982 | 1.0334 | 1.1302 | | hf_Albert | 8 | 1.0001 | 0.936 | 1.0313 | 1.4693 | | attention_is_all_you_need_pytorch | 256 | 0.9979 | 0.94 | 1.005 | 1.1086 | | timm_efficientdet | 1 | 1.028 | 0.8414 | 0.9991 | 1.0312 | | Background_Matting | 4 | 1.0142 | 0.9624 | 0.9916 | 1.0426 | | tts_angular | 64 | 1.0002 | 1.0002 | 0.9895 | 1.0002 | | demucs | 4 | 0.9872 | 0.9872 | 0.9872 | 0.9872 | | hf_GPT2 | 4 | 0.9987 | 0.8846 | 0.9649 | 1.1241 | | pytorch_CycleGAN_and_pix2pix | 1 | 1.0 | 0.8754 | 0.9506 | 1.0224 | | timm_regnet | 32 | 0.9953 | 0.8446 | 0.9345 | 1.0307 | | hf_T5 | 8 | 1.0 | 0.9331 | 0.9304 | 1.2458 | | resnet152 | 32 | 0.9937 | 0.8956 | 0.9125 | 0.9398 | | pytorch_unet | 1 | 0.9968 | 0.8653 | 0.9111 | 1.0853 | | yolov3 | 16 | 0.9908 | 0.8381 | 0.9063 | 1.0466 | | speech_transformer | 32 | 0.9991 | 0.9812 | 0.8824 | 0.8866 | | timm_vision_transformer_large | 8 | 0.9974 | 0.8358 | 0.879 | 1.0245 | | BERT_pytorch | 16 | 1.0003 | 0.8822 | 0.8778 | 1.0948 | | timm_resnest | 32 | 0.9868 | 0.8711 | 0.8759 | 0.9953 | | densenet121 | 4 | 0.9857 | 0.8678 | 0.8753 | 1.0051 | | squeezenet1_1 | 32 | 0.9604 | 0.7958 | 0.8735 | 1.0608 | | hf_Bert | 4 | 1.0 | 0.8759 | 0.8728 | 0.942 | | shufflenet_v2_x1_0 | 128 | 0.956 | 0.8401 | 0.8692 | 0.9802 | | resnet50 | 32 | 0.9907 | 0.8629 | 0.8659 | 0.885 | | hf_T5_large | 2 | 0.8541 | 0.8541 | 0.8541 | 0.8541 | | hf_DistilBert | 8 | 0.9993 | 0.8802 | 0.8348 | 0.9049 | | fastNLP_Bert | 6 | 1.0012 | 0.8966 | 0.8013 | 1.0681 | | alexnet | 128 | 0.951 | 0.7753 | 0.7973 | 1.0079 | | hf_Bart | 4 | 1.0002 | 0.8307 | 0.7933 | 0.9724 | | mobilenet_v3_large | 32 | 0.9776 | 0.8499 | 0.791 | 0.8143 | | timm_vovnet | 32 | 0.9903 | 0.7678 | 0.7799 | 0.8875 | | pytorch_stargan | 16 | 0.9929 | 0.9742 | 0.7783 | 0.8847 | | resnext50_32x4d | 8 | 0.9932 | 0.8549 | 0.7644 | 0.7753 | | vgg16 | 64 | 0.9924 | 0.7339 | 0.7633 | 1.0588 | | mnasnet1_0 | 32 | 0.9785 | 0.8621 | 0.7541 | 0.7741 | | drq | 1 | 0.9877 | 0.8312 | 0.752 | 0.9256 | | soft_actor_critic | 256 | 0.9998 | 0.9149 | 0.7295 | 1.0368 | | LearningToPaint | 96 | 0.9252 | 0.7196 | 0.7295 | 0.925 | | timm_vision_transformer | 8 | 0.9952 | 0.8826 | 0.7133 | 0.7227 | | resnet18 | 16 | 0.9779 | 0.7727 | 0.6102 | 0.6257 | | hf_Reformer | 4 | 0.9996 | 0.9996 | 0.5851 | 1.0014 | | lennard_jones | 1000 | 0.9995 | 0.9997 | 0.564 | 0.9991 | | nvidia_deeprecommender | 256 | 0.5596 | 0.5596 | 0.5596 | 0.5596 | | functorch_dp_cifar10 | 64 | 0.9964 | 0.8107 | 0.4481 | 0.4691 | | pytorch_struct | 200 | 1.0 | 0.5081 | 0.4235 | 0.4353 | | dcgan | 32 | 0.9698 | 0.7838 | 0.2123 | 0.2137 | | hf_GPT2_large | 4 | 0.9956 | 0.8732 | nan | 1.1499 | | tacotron2 | 64 | 0.9866 | 0.4045 | nan | 0.4112 | | dlrm | 2048 | 0.7301 | 0.7306 | nan | nan | | hf_BigBird | 2 | 0.9489 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | +-----------------------------------+------+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------+------+-----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------+------+-----------+-----------+----------+------------------------+ | timm_vision_transformer_large | 8 | 183.9264 | 185.8603 | 168.7355 | 171.7368 | | Background_Matting | 4 | 141.7648 | 131.3625 | 107.7583 | 109.2646 | | hf_T5 | 8 | 174.4926 | 189.4402 | 93.3232 | 92.963 | | hf_T5_large | 2 | 218.1989 | 260.3696 | 89.1603 | 110.8583 | | timm_nfnet | 128 | 131.8874 | 131.5784 | 87.2286 | 91.5348 | | hf_Reformer | 4 | 82.3598 | 82.1781 | 69.8371 | 69.6854 | | Super_SloMo | 6 | 79.0805 | 79.3464 | 64.5105 | 66.2058 | | yolov3 | 16 | 68.667 | 69.0193 | 62.9919 | 64.2431 | | demucs | 4 | 57.9343 | 57.1161 | 57.0935 | 57.2062 | | timm_regnet | 32 | 73.5289 | 81.4089 | 55.1698 | 60.0619 | | vgg16 | 64 | 66.2422 | 66.2093 | 52.002 | 52.3533 | | resnet152 | 32 | 91.0037 | 97.7275 | 45.7896 | 73.8826 | | speech_transformer | 32 | 65.2427 | 75.1714 | 41.5773 | 40.3135 | | fastNLP_Bert | 6 | 55.9758 | 62.4977 | 37.2314 | 38.5491 | | timm_efficientdet | 1 | 163.1827 | 214.6085 | 36.1618 | 110.5349 | | attention_is_all_you_need_pytorch | 256 | 52.8984 | 59.2412 | 34.8035 | 37.186 | | hf_Bart | 4 | 55.5883 | 67.7852 | 33.957 | 36.148 | | mobilenet_v2 | 96 | 48.8565 | 49.4278 | 31.3401 | 32.1664 | | hf_Albert | 8 | 68.2827 | 72.0985 | 29.3207 | 29.982 | | pytorch_unet | 1 | 39.9271 | 40.1581 | 29.3201 | 29.9666 | | hf_GPT2 | 4 | 52.4292 | 49.6814 | 25.4594 | 25.8753 | | timm_vovnet | 32 | 34.752 | 38.1958 | 24.8979 | 28.7185 | | shufflenet_v2_x1_0 | 128 | 42.876 | 42.1499 | 24.2456 | 29.1135 | | timm_efficientnet | 32 | 48.7395 | 61.4532 | 22.4523 | 37.767 | | hf_Bert | 4 | 40.6596 | 58.173 | 21.2743 | 23.4495 | | hf_DistilBert | 8 | 30.9806 | 31.8895 | 20.8157 | 21.2662 | | resnet50 | 32 | 33.7115 | 35.1441 | 19.3801 | 27.46 | | BERT_pytorch | 16 | 55.6925 | 66.4554 | 16.8192 | 24.9592 | | timm_resnest | 32 | 25.0597 | 24.8603 | 12.8525 | 15.7415 | | densenet121 | 4 | 72.9717 | 81.5106 | 12.6771 | 59.61 | | mobilenet_v3_large | 32 | 34.9903 | 34.941 | 11.9799 | 26.5817 | | mnasnet1_0 | 32 | 28.9991 | 28.4173 | 11.4931 | 22.367 | | pytorch_stargan | 16 | 16.102 | 15.896 | 10.9192 | 11.5913 | | nvidia_deeprecommender | 256 | 10.3666 | 10.4037 | 10.4632 | 10.05 | | timm_vision_transformer | 8 | 33.921 | 34.6595 | 9.9535 | 20.4789 | | resnext50_32x4d | 8 | 33.0804 | 30.4899 | 8.4924 | 23.3554 | | LearningToPaint | 96 | 15.4426 | 14.8511 | 8.4605 | 11.4183 | | alexnet | 128 | 9.7884 | 9.8124 | 8.0901 | 8.1139 | | tts_angular | 64 | 6.9398 | 6.5844 | 6.7018 | 7.2245 | | pytorch_CycleGAN_and_pix2pix | 1 | 18.098 | 18.5743 | 6.6768 | 11.9156 | | squeezenet1_1 | 32 | 15.1004 | 15.4611 | 6.215 | 11.7538 | | resnet18 | 16 | 12.9642 | 13.1091 | 4.7231 | 11.7877 | | functorch_dp_cifar10 | 64 | 14.21 | 15.0108 | 2.9591 | 15.0933 | | pytorch_struct | 200 | 4.6757 | 6.1055 | 2.277 | 3.7498 | | drq | 1 | 3.8879 | 4.7963 | 1.3729 | 3.6489 | | dcgan | 32 | 3.1322 | 3.4376 | 1.107 | 2.9838 | | soft_actor_critic | 256 | 1.3741 | 1.8739 | 0.8557 | 1.4163 | | lennard_jones | 1000 | 1.4503 | 2.1461 | 0.749 | 1.4673 | | tacotron2 | 64 | 3526.5577 | 4226.6164 | nan | 3532.7074 | | hf_GPT2_large | 4 | 209.2206 | 211.7662 | nan | 112.3685 | | dlrm | 2048 | 501.5169 | 490.557 | nan | nan | | hf_BigBird | 2 | 195.5097 | nan | nan | nan | | hf_Longformer | 0 | nan | nan | nan | nan | | moco | 0 | nan | nan | nan | nan | +-----------------------------------+------+-----------+-----------+----------+------------------------+ ~~~

huggingface suite with amp precision

Performance speedup ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | YituTechConvBert | 1 | 1.0223 | 0.8377 | 4.8405 | 1.6583 | | MobileBertForMaskedLM | 32 | 1.0172 | 0.8422 | 4.1581 | 1.8028 | | CamemBert | 1 | 1.0447 | 0.8521 | 3.7763 | 1.7973 | | MobileBertForQuestionAnswering | 64 | 1.0168 | 0.8377 | 3.6592 | 1.7789 | | MT5ForConditionalGeneration | 8 | 1.0153 | 0.8552 | 3.4685 | 2.5255 | | DistillGPT2 | 1 | 1.0365 | 0.8788 | 2.704 | 2.0011 | | GPT2ForSequenceClassification | 4 | 1.0029 | 0.9693 | 2.3192 | 2.2924 | | M2M100ForConditionalGeneration | 8 | 1.0065 | 0.9218 | 2.2067 | 1.7105 | | ElectraForQuestionAnswering | 64 | 1.0004 | 0.9797 | 2.0342 | 1.9779 | | MegatronBertForQuestionAnswering | 16 | 1.0356 | 0.8521 | 1.95 | 1.8031 | | PLBartForConditionalGeneration | 16 | 1.0125 | 0.8352 | 1.8827 | 1.6882 | | MegatronBertForCausalLM | 16 | 1.0334 | 0.8527 | 1.8022 | 1.7497 | | LayoutLMForSequenceClassification | 16 | 1.0001 | 0.9803 | 1.7954 | 1.7491 | | ElectraForCausalLM | 32 | 0.9998 | 0.9298 | 1.7505 | 1.7562 | | XGLMForCausalLM | 8 | 1.0122 | 0.8251 | 1.7391 | 1.7801 | | T5Small | 1 | 1.0264 | 0.9043 | 1.7388 | 1.5015 | | AlbertForQuestionAnswering | 4 | 0.9999 | 0.8859 | 1.6477 | 1.6393 | | AlbertForMaskedLM | 4 | 1.0002 | 0.885 | 1.6361 | 1.6283 | | MBartForConditionalGeneration | 16 | 1.0151 | 0.8351 | 1.6334 | 1.5862 | | PegasusForConditionalGeneration | 16 | 1.0127 | 0.8279 | 1.6253 | 1.529 | | LayoutLMForMaskedLM | 16 | 1.0008 | 0.9707 | 1.606 | 1.5814 | | T5ForConditionalGeneration | 4 | 1.0079 | 0.9015 | 1.6022 | 1.5676 | | OPTForCausalLM | 32 | 1.0068 | 0.9306 | 1.5325 | 1.5097 | | Speech2Text2ForCausalLM | 128 | 1.0069 | 0.9343 | 1.4927 | 1.4985 | | RobertaForQuestionAnswering | 128 | 1.0003 | 0.9849 | 1.4461 | 1.4066 | | DistilBertForQuestionAnswering | 64 | 1.0007 | 0.9477 | 1.442 | 1.3996 | | BertForQuestionAnswering | 128 | 1.0 | 0.9745 | 1.4387 | 1.4119 | | BartForConditionalGeneration | 2 | 1.0045 | 0.9697 | 1.4202 | 1.3891 | | BartForCausalLM | 4 | 1.0011 | 0.9698 | 1.4151 | 1.4143 | | RobertaForCausalLM | 64 | 1.0004 | 0.9603 | 1.4004 | 1.3807 | | BlenderbotSmallForConditionalGeneration | 64 | 1.0076 | 0.8829 | 1.379 | 1.3854 | | DebertaForMaskedLM | 4 | 0.9208 | 0.7366 | 1.2999 | 1.1375 | | BertForMaskedLM | 64 | 1.0005 | 0.9564 | 1.2988 | 1.2848 | | PLBartForCausalLM | 32 | 1.0067 | 0.9416 | 1.2218 | 1.2467 | | BlenderbotSmallForCausalLM | 64 | 1.0018 | 0.9261 | 1.2135 | 1.2264 | | DistilBertForMaskedLM | 64 | 1.0002 | 0.9392 | 1.2126 | 1.2118 | | MBartForCausalLM | 32 | 1.0036 | 0.9427 | 1.1666 | 1.1628 | | TrOCRForCausalLM | 32 | 1.0017 | 0.9485 | 1.1621 | 1.1628 | | DebertaForQuestionAnswering | 8 | 0.9861 | 0.8674 | 1.1368 | 1.211 | | PegasusForCausalLM | 32 | 0.9991 | 0.9505 | 1.1354 | 1.1366 | | BigBird | 1 | 0.978 | 0.0 | 0.0 | 0.0 | | AllenaiLongformerBase | 0 | 0.0 | 0.0 | 0.0 | 0.0 | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +-----------------------------------------+----+-------------+-------------+-------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+----+-------------+-------------+-------------+------------------------+ | AlbertForMaskedLM | 1 | pass | pass | pass | pass | | MBartForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForCausalLM | 1 | pass | pass | pass | pass | | MegatronBertForQuestionAnswering | 1 | pass | pass | pass | pass | | MobileBertForMaskedLM | 1 | pass | pass | pass | pass | | MobileBertForQuestionAnswering | 1 | pass | pass | pass | pass | | OPTForCausalLM | 1 | pass | pass | pass | pass | | PLBartForCausalLM | 1 | pass | pass | pass | pass | | PegasusForCausalLM | 1 | pass | pass | pass | pass | | PegasusForConditionalGeneration | 1 | pass | pass | pass | pass | | RobertaForCausalLM | 1 | pass | pass | pass | pass | | RobertaForQuestionAnswering | 1 | pass | pass | pass | pass | | Speech2Text2ForCausalLM | 1 | pass | pass | pass | pass | | T5ForConditionalGeneration | 1 | pass | pass | pass | pass | | T5Small | 1 | pass | pass | pass | pass | | TrOCRForCausalLM | 1 | pass | pass | pass | pass | | XGLMForCausalLM | 1 | pass | pass | pass | pass | | XLNetLMHeadModel | 1 | pass | pass | pass | pass | | YituTechConvBert | 1 | pass | pass | pass | pass | | AlbertForQuestionAnswering | 1 | pass | pass | pass | pass | | MT5ForConditionalGeneration | 1 | pass | pass | pass | pass | | M2M100ForConditionalGeneration | 1 | pass | pass | pass | pass | | DebertaForMaskedLM | 1 | pass | pass | pass | pass | | BartForCausalLM | 1 | pass | pass | pass | pass | | BartForConditionalGeneration | 1 | pass | pass | pass | pass | | BertForMaskedLM | 1 | pass | pass | pass | pass | | BertForQuestionAnswering | 1 | pass | pass | pass | pass | | BlenderbotSmallForCausalLM | 1 | pass | pass | pass | pass | | BlenderbotSmallForConditionalGeneration | 1 | pass | pass | pass | pass | | LayoutLMForSequenceClassification | 1 | pass | pass | pass | pass | | CamemBert | 1 | pass | pass | pass | pass | | DebertaForQuestionAnswering | 1 | pass | pass | pass | pass | | DistilBertForMaskedLM | 1 | pass | pass | pass | pass | | DistilBertForQuestionAnswering | 1 | pass | pass | pass | pass | | DistillGPT2 | 1 | pass | pass | pass | pass | | ElectraForCausalLM | 1 | pass | pass | pass | pass | | ElectraForQuestionAnswering | 1 | pass | pass | pass | pass | | GPT2ForSequenceClassification | 1 | pass | pass | pass | pass | | LayoutLMForMaskedLM | 1 | pass | pass | pass | pass | | MBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | | PLBartForConditionalGeneration | 1 | pass | pass | fail_to_run | fail_to_run | | BigBird | 1 | pass | fail_to_run | fail_to_run | fail_to_run | | AllenaiLongformerBase | 1 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | +-----------------------------------------+----+-------------+-------------+-------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ | DebertaForMaskedLM | 4 | 5.3801 | 11.2863 | 105.4539 | 39.865 | | DebertaForQuestionAnswering | 8 | 5.2502 | 11.0802 | 103.2721 | 39.8822 | | MobileBertForMaskedLM | 32 | 10.1116 | 35.1002 | 84.8629 | 81.364 | | MobileBertForQuestionAnswering | 64 | 10.3878 | 35.1481 | 82.8535 | 79.2318 | | XGLMForCausalLM | 8 | 3.179 | 13.6261 | 81.5125 | 79.9322 | | M2M100ForConditionalGeneration | 8 | 4.2771 | 15.8666 | 74.5191 | 70.9121 | | MBartForConditionalGeneration | 16 | 4.0745 | 17.4895 | 60.9653 | 59.3313 | | PegasusForConditionalGeneration | 16 | 3.8294 | 17.1739 | 60.6207 | 55.9846 | | BartForConditionalGeneration | 2 | 4.0219 | 17.3592 | 59.9891 | 57.8295 | | YituTechConvBert | 1 | 2.8404 | 11.0009 | 52.7953 | 48.6038 | | MegatronBertForCausalLM | 16 | 4.1548 | 14.6527 | 48.7612 | 46.4789 | | MegatronBertForQuestionAnswering | 16 | 3.9894 | 14.5425 | 47.1127 | 45.9904 | | MT5ForConditionalGeneration | 8 | 4.0593 | 13.2671 | 44.9256 | 42.6721 | | BlenderbotSmallForConditionalGeneration | 64 | 2.4997 | 11.5858 | 40.9287 | 39.0688 | | T5Small | 1 | 2.6591 | 9.1279 | 33.6601 | 32.7428 | | T5ForConditionalGeneration | 4 | 2.6646 | 9.0338 | 33.5673 | 32.4119 | | PLBartForConditionalGeneration | 16 | 2.1074 | 8.7542 | 33.5127 | 33.565 | | LayoutLMForSequenceClassification | 16 | 2.3135 | 7.7566 | 31.3464 | 29.3416 | | ElectraForCausalLM | 32 | 2.0451 | 7.4345 | 30.7162 | 28.5186 | | PegasusForCausalLM | 32 | 1.5579 | 6.588 | 26.5697 | 24.9646 | | LayoutLMForMaskedLM | 16 | 2.4908 | 7.7934 | 26.472 | 24.7561 | | MBartForCausalLM | 32 | 1.4904 | 6.6255 | 25.2115 | 23.8657 | | RobertaForCausalLM | 64 | 1.8812 | 7.3097 | 24.9204 | 24.3817 | | BertForMaskedLM | 64 | 1.8909 | 7.1967 | 24.4951 | 23.6636 | | ElectraForQuestionAnswering | 64 | 2.001 | 7.3002 | 24.4523 | 23.0111 | | OPTForCausalLM | 32 | 1.5718 | 7.2784 | 24.0511 | 22.5163 | | TrOCRForCausalLM | 32 | 1.4793 | 6.6125 | 23.9797 | 23.0131 | | BartForCausalLM | 4 | 1.5506 | 6.6132 | 23.7612 | 22.66 | | BertForQuestionAnswering | 128 | 1.8734 | 7.2512 | 23.5406 | 22.859 | | RobertaForQuestionAnswering | 128 | 1.9098 | 7.1241 | 22.7278 | 21.4792 | | CamemBert | 1 | 1.9359 | 7.5018 | 21.8414 | 20.8427 | | AlbertForMaskedLM | 4 | 1.7175 | 7.3031 | 21.1479 | 20.3657 | | AlbertForQuestionAnswering | 4 | 1.8347 | 7.0632 | 20.5988 | 19.4607 | | GPT2ForSequenceClassification | 4 | 1.8037 | 6.5065 | 19.9998 | 19.5073 | | BlenderbotSmallForCausalLM | 64 | 1.0406 | 4.4795 | 17.6594 | 16.8046 | | Speech2Text2ForCausalLM | 128 | 0.9075 | 3.4969 | 16.2453 | 14.7561 | | PLBartForCausalLM | 32 | 0.8422 | 3.4917 | 15.1231 | 15.0981 | | DistilBertForMaskedLM | 64 | 0.8394 | 3.5858 | 14.807 | 14.1055 | | DistilBertForQuestionAnswering | 64 | 0.8397 | 3.7957 | 14.2802 | 13.5993 | | DistillGPT2 | 1 | 0.9719 | 3.3796 | 14.0105 | 13.6432 | | BigBird | 1 | 4.0268 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+---------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ | GPT2ForSequenceClassification | 4 | 1.0001 | 0.9162 | 1.0783 | 1.1717 | | AlbertForQuestionAnswering | 4 | 1.0 | 0.754 | 1.0323 | 1.5286 | | BartForCausalLM | 4 | 1.0 | 0.8997 | 1.0218 | 1.0756 | | AlbertForMaskedLM | 4 | 1.0 | 0.7431 | 1.0074 | 1.5007 | | LayoutLMForSequenceClassification | 16 | 1.004 | 0.9325 | 0.9844 | 1.025 | | BertForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.9837 | 1.0483 | | RobertaForQuestionAnswering | 128 | 1.0008 | 0.952 | 0.9837 | 1.0483 | | ElectraForQuestionAnswering | 64 | 1.0016 | 0.9538 | 0.9829 | 1.0613 | | BartForConditionalGeneration | 2 | 1.0 | 0.9073 | 0.9691 | 1.1807 | | T5ForConditionalGeneration | 4 | 0.9998 | 0.9527 | 0.9658 | 1.1446 | | T5Small | 1 | 1.0 | 0.8935 | 0.9652 | 1.1096 | | PegasusForCausalLM | 32 | 0.9749 | 0.9114 | 0.9327 | 0.9847 | | PegasusForConditionalGeneration | 16 | 0.9985 | 0.9635 | 0.9159 | 1.0769 | | LayoutLMForMaskedLM | 16 | 1.0 | 0.9238 | 0.9124 | 0.9464 | | BlenderbotSmallForConditionalGeneration | 64 | 0.9999 | 0.8918 | 0.9037 | 1.0411 | | MBartForCausalLM | 32 | 1.0 | 0.8924 | 0.9006 | 0.9641 | | MegatronBertForQuestionAnswering | 16 | 1.0 | 0.8529 | 0.893 | 1.0053 | | MegatronBertForCausalLM | 16 | 1.0001 | 0.8597 | 0.8919 | 1.0207 | | DistilBertForQuestionAnswering | 64 | 1.0004 | 0.9216 | 0.89 | 0.9848 | | BertForMaskedLM | 64 | 0.9996 | 0.899 | 0.8834 | 0.9285 | | RobertaForCausalLM | 64 | 0.9999 | 0.8994 | 0.8828 | 0.9282 | | TrOCRForCausalLM | 32 | 1.0 | 0.8921 | 0.8816 | 0.9425 | | MBartForConditionalGeneration | 16 | 1.0 | 0.8555 | 0.8755 | 1.0595 | | MT5ForConditionalGeneration | 8 | 0.919 | 0.83 | 0.875 | 0.919 | | OPTForCausalLM | 32 | 1.0003 | 0.8678 | 0.8727 | 0.9449 | | PLBartForConditionalGeneration | 16 | 0.9983 | 0.9 | 0.8523 | 0.9876 | | DistilBertForMaskedLM | 64 | 1.0 | 0.86 | 0.8215 | 0.8801 | | CamemBert | 1 | 0.999 | 0.8143 | 0.8065 | 0.9306 | | XGLMForCausalLM | 8 | 0.9918 | 0.9234 | 0.8055 | 0.9516 | | DistillGPT2 | 1 | 0.9975 | 0.8033 | 0.8048 | 0.9949 | | Speech2Text2ForCausalLM | 128 | 0.9676 | 0.8427 | 0.8039 | 0.898 | | PLBartForCausalLM | 32 | 1.0003 | 0.8444 | 0.7975 | 0.8675 | | ElectraForCausalLM | 32 | 0.9977 | 0.848 | 0.7949 | 0.8607 | | YituTechConvBert | 1 | 0.9718 | 0.8664 | 0.7909 | 0.9314 | | BlenderbotSmallForCausalLM | 64 | 0.9998 | 0.8172 | 0.778 | 0.859 | | M2M100ForConditionalGeneration | 8 | 0.9892 | 0.9674 | 0.752 | 0.9892 | | MobileBertForMaskedLM | 32 | 0.9998 | 0.8864 | 0.5931 | 0.7994 | | MobileBertForQuestionAnswering | 64 | 1.0153 | 0.9965 | 0.4995 | 0.635 | | DebertaForMaskedLM | 4 | 0.9982 | 0.9825 | 0.409 | 1.026 | | DebertaForQuestionAnswering | 8 | 0.9543 | 1.0481 | 0.3071 | 1.1616 | | BigBird | 1 | 0.9748 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ | AlbertForMaskedLM | 4 | 266.4648 | 301.2449 | 163.2613 | 163.9857 | | AlbertForQuestionAnswering | 4 | 264.314 | 298.5267 | 160.8391 | 161.5659 | | BartForConditionalGeneration | 2 | 135.7444 | 140.5032 | 95.6537 | 97.8556 | | BlenderbotSmallForConditionalGeneration | 64 | 109.2364 | 127.0885 | 79.9387 | 79.588 | | BartForCausalLM | 4 | 111.9369 | 115.5414 | 79.15 | 79.0943 | | BertForQuestionAnswering | 128 | 110.4708 | 113.2358 | 76.9385 | 78.3261 | | RobertaForQuestionAnswering | 128 | 110.9423 | 112.6007 | 76.8231 | 78.8463 | | LayoutLMForMaskedLM | 16 | 111.9368 | 115.4 | 70.2275 | 70.8414 | | MBartForConditionalGeneration | 16 | 103.2824 | 126.8209 | 66.9643 | 70.8351 | | PegasusForConditionalGeneration | 16 | 104.1201 | 126.843 | 66.8 | 72.9854 | | DebertaForQuestionAnswering | 8 | 76.1169 | 86.5159 | 66.1531 | 61.7785 | | T5ForConditionalGeneration | 4 | 100.9954 | 112.8121 | 63.5187 | 64.378 | | PegasusForCausalLM | 32 | 68.7242 | 72.7706 | 60.5768 | 60.3738 | | MBartForCausalLM | 32 | 69.6191 | 74.0819 | 59.9933 | 59.9371 | | TrOCRForCausalLM | 32 | 69.6037 | 75.1835 | 59.9421 | 59.9351 | | BertForMaskedLM | 64 | 75.4725 | 78.9032 | 58.1885 | 58.7378 | | RobertaForCausalLM | 64 | 80.2354 | 83.6752 | 57.4648 | 58.2262 | | ElectraForQuestionAnswering | 64 | 114.7386 | 116.8161 | 56.3347 | 57.8761 | | LayoutLMForSequenceClassification | 16 | 97.1061 | 99.1705 | 54.1191 | 55.5783 | | MobileBertForQuestionAnswering | 64 | 190.5361 | 246.6218 | 53.3437 | 105.1289 | | XGLMForCausalLM | 8 | 87.3977 | 107.6528 | 52.8369 | 63.9088 | | M2M100ForConditionalGeneration | 8 | 124.6816 | 120.6299 | 50.6169 | 76.6523 | | DebertaForMaskedLM | 4 | 75.1184 | 97.4156 | 50.3674 | 56.7563 | | ElectraForCausalLM | 32 | 87.5247 | 93.7665 | 49.8239 | 49.7113 | | BlenderbotSmallForCausalLM | 64 | 58.6216 | 63.6498 | 48.3584 | 48.0312 | | MegatronBertForCausalLM | 16 | 87.7817 | 96.1121 | 47.167 | 57.5141 | | MobileBertForMaskedLM | 32 | 214.0348 | 241.6149 | 43.5724 | 101.4571 | | MegatronBertForQuestionAnswering | 16 | 79.9413 | 97.1358 | 43.4894 | 47.403 | | GPT2ForSequenceClassification | 4 | 91.9111 | 93.5004 | 39.0465 | 39.8145 | | T5Small | 1 | 63.1919 | 73.9268 | 38.9087 | 48.7533 | | DistilBertForMaskedLM | 64 | 45.0861 | 48.1106 | 37.2482 | 37.3007 | | OPTForCausalLM | 32 | 53.6738 | 58.4399 | 35.5267 | 35.821 | | PLBartForCausalLM | 32 | 39.0895 | 41.7897 | 31.6408 | 31.7126 | | PLBartForConditionalGeneration | 16 | 55.6642 | 66.8187 | 30.5678 | 34.4809 | | MT5ForConditionalGeneration | 8 | 104.1116 | 122.9308 | 26.588 | 37.1221 | | DistilBertForQuestionAnswering | 64 | 30.5677 | 33.1067 | 21.0854 | 21.7901 | | Speech2Text2ForCausalLM | 128 | 30.3003 | 32.4641 | 20.5193 | 20.4287 | | YituTechConvBert | 1 | 62.0851 | 74.0989 | 13.8072 | 39.9998 | | CamemBert | 1 | 37.0307 | 46.364 | 11.158 | 22.7364 | | DistillGPT2 | 1 | 20.2655 | 23.6782 | 8.0009 | 10.8269 | | BigBird | 1 | 192.3145 | nan | nan | nan | | AllenaiLongformerBase | 0 | nan | nan | nan | nan | +-----------------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

timm_models suite with amp precision

Performance speedup ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | regnety_002 | 128 | 0.9781 | 0.9404 | 2.1425 | 1.4351 | | ghostnet_100 | 128 | 1.0033 | 0.9796 | 2.1277 | 1.7897 | | xcit_large_24_p8_224 | 5 | 1.0008 | 0.0 | 2.1168 | 1.8655 | | lcnet_050 | 128 | 0.9658 | 0.947 | 2.0285 | 1.6218 | | tnt_s_patch16_224 | 128 | 0.9999 | 0.9969 | 1.9232 | 1.8934 | | twins_pcpvt_base | 64 | 1.0062 | 0.93 | 1.756 | 1.64 | | hrnet_w18 | 128 | 1.0034 | 1.0277 | 1.6901 | 1.4398 | | res2net101_26w_4s | 64 | 1.0038 | 1.0123 | 1.6128 | 1.3283 | | coat_lite_mini | 128 | 1.0 | 0.9885 | 1.5891 | 1.5719 | | dla102 | 128 | 1.0 | 0.9958 | 1.5816 | 1.5483 | | nfnet_l0 | 128 | 0.999 | 0.8101 | 1.558 | 1.4681 | | volo_d1_224 | 64 | 0.9999 | 0.9938 | 1.5526 | 1.5209 | | resnest101e | 64 | 1.0036 | 0.991 | 1.5479 | 1.5026 | | gmlp_s16_224 | 128 | 0.9999 | 0.9956 | 1.5229 | 1.5014 | | gluon_inception_v3 | 128 | 1.0 | 0.9962 | 1.5057 | 1.4717 | | adv_inception_v3 | 128 | 0.9999 | 0.9964 | 1.5034 | 1.464 | | inception_v3 | 128 | 0.9998 | 0.9965 | 1.5005 | 1.4662 | | dm_nfnet_f0 | 128 | 0.9984 | 0.9993 | 1.5002 | 1.4296 | | gmixer_24_224 | 128 | 0.9999 | 0.8807 | 1.4936 | 1.4814 | | res2net50_14w_8s | 128 | 1.0001 | 0.9927 | 1.4852 | 1.4124 | | swin_base_patch4_window7_224 | 64 | 0.9998 | 0.9588 | 1.4813 | 1.4135 | | mobilenetv3_large_100 | 128 | 0.9531 | 0.9449 | 1.4485 | 1.4297 | | selecsls42b | 128 | 0.9999 | 0.9956 | 1.443 | 1.4108 | | res2next50 | 128 | 0.9994 | 0.9953 | 1.4175 | 1.3462 | | mnasnet_100 | 128 | 0.9535 | 0.9431 | 1.416 | 1.4608 | | cait_m36_384 | 4 | 1.0005 | 1.0096 | 1.4152 | 1.3657 | | fbnetv3_b | 128 | 0.9526 | 0.9397 | 1.4041 | 1.3937 | | mobilenetv2_100 | 128 | 0.951 | 0.9421 | 1.4007 | 1.4335 | | crossvit_9_240 | 128 | 1.0001 | 0.9942 | 1.3954 | 1.3682 | | convit_base | 64 | 1.0 | 0.9968 | 1.3906 | 1.3175 | | ese_vovnet19b_dw | 128 | 0.9704 | 0.9642 | 1.3718 | 1.3793 | | mobilevit_s | 64 | 0.9732 | 0.8144 | 1.3608 | 1.3593 | | jx_nest_base | 32 | 1.0 | 0.9925 | 1.3602 | 1.3268 | | fbnetc_100 | 128 | 0.9523 | 0.9398 | 1.3521 | 1.3732 | | spnasnet_100 | 128 | 0.9461 | 0.936 | 1.3507 | 1.3272 | | resmlp_12_224 | 128 | 1.0 | 0.9986 | 1.3303 | 1.2978 | | poolformer_m36 | 64 | 0.9998 | 0.9983 | 1.326 | 1.2952 | | tf_efficientnet_b0 | 128 | 0.9652 | 0.8074 | 1.3246 | 1.3554 | | botnet26t_256 | 128 | 0.9783 | 0.9733 | 1.3236 | 1.3302 | | pit_b_224 | 64 | 0.9998 | 0.9953 | 1.3156 | 1.3091 | | pnasnet5large | 16 | 1.0051 | 1.0406 | 1.3115 | 1.2719 | | cspdarknet53 | 64 | 0.9431 | 0.9343 | 1.3027 | 1.3242 | | rexnet_100 | 128 | 0.9656 | 0.8497 | 1.2723 | 1.2774 | | tinynet_a | 128 | 0.9723 | 0.8029 | 1.2714 | 1.3288 | | eca_botnext26ts_256 | 128 | 0.9801 | 0.8115 | 1.2712 | 1.2678 | | mixer_b16_224 | 128 | 0.9999 | 0.9976 | 1.2593 | 1.2499 | | beit_base_patch16_224 | 64 | 1.0 | 0.9785 | 1.2465 | 1.2307 | | deit_base_distilled_patch16_224 | 64 | 0.9997 | 0.9913 | 1.2391 | 1.222 | | visformer_small | 128 | 0.9996 | 0.999 | 1.231 | 1.1753 | | dpn107 | 32 | 0.9569 | 0.9281 | 1.2072 | 1.183 | | sebotnet33ts_256 | 64 | 0.9657 | 0.8369 | 1.2037 | 1.1982 | | tf_mixnet_l | 128 | 0.9785 | 0.9092 | 1.1794 | 1.1732 | | mixnet_l | 128 | 0.9797 | 0.9055 | 1.1618 | 1.1555 | | gluon_xception65 | 32 | 0.9996 | 0.99 | 1.159 | 1.1246 | | vit_base_patch16_224 | 64 | 1.0 | 0.9936 | 1.1576 | 1.1465 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.9815 | 1.1355 | 1.0556 | | repvgg_a2 | 128 | 0.9426 | 0.9346 | 1.1034 | 1.1196 | | gernet_l | 128 | 0.947 | 0.9378 | 1.0641 | 1.0776 | | convmixer_768_32 | 32 | 0.9999 | 0.9982 | 1.056 | 1.0506 | | convnext_base | 64 | 0.9995 | 0.9953 | 0.6631 | 0.6452 | | eca_halonext26ts | 128 | 0.9813 | 0.8163 | 0.0 | 0.0 | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Accuracy ~~~ +---------------------------------+----+-------------+---------------+---------------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+----+-------------+---------------+---------------+------------------------+ | adv_inception_v3 | 2 | pass | pass | pass | pass | | mobilevit_s | 2 | pass | pass | pass | pass | | beit_base_patch16_224 | 2 | pass | pass | pass | pass | | pnasnet5large | 2 | pass | pass | pass | pass | | regnety_002 | 2 | pass | pass | pass | pass | | repvgg_a2 | 2 | pass | pass | pass | pass | | res2net101_26w_4s | 2 | pass | pass | pass | pass | | res2net50_14w_8s | 2 | pass | pass | pass | pass | | res2next50 | 2 | pass | pass | pass | pass | | resmlp_12_224 | 2 | pass | pass | pass | pass | | resnest101e | 2 | pass | pass | pass | pass | | rexnet_100 | 2 | pass | pass | pass | pass | | sebotnet33ts_256 | 2 | pass | pass | pass | pass | | selecsls42b | 2 | pass | pass | pass | pass | | swin_base_patch4_window7_224 | 2 | pass | pass | pass | pass | | swsl_resnext101_32x16d | 2 | pass | pass | pass | pass | | tf_efficientnet_b0 | 2 | pass | pass | pass | pass | | tf_mixnet_l | 2 | pass | pass | pass | pass | | tinynet_a | 2 | pass | pass | pass | pass | | tnt_s_patch16_224 | 2 | pass | pass | pass | pass | | twins_pcpvt_base | 2 | pass | pass | pass | pass | | visformer_small | 2 | pass | pass | pass | pass | | vit_base_patch16_224 | 2 | pass | pass | pass | pass | | volo_d1_224 | 2 | pass | pass | pass | pass | | xcit_large_24_p8_224 | 2 | pass | fail_to_run | pass | pass | | cait_m36_384 | 2 | pass | fail_accuracy | pass | pass | | coat_lite_mini | 2 | pass | fail_accuracy | pass | pass | | nfnet_l0 | 2 | pass | pass | pass | pass | | pit_b_224 | 2 | pass | pass | pass | pass | | mobilenetv3_large_100 | 2 | pass | pass | pass | pass | | fbnetc_100 | 2 | pass | pass | pass | pass | | botnet26t_256 | 2 | pass | pass | pass | pass | | convmixer_768_32 | 2 | pass | pass | pass | pass | | convnext_base | 2 | pass | pass | pass | pass | | crossvit_9_240 | 2 | pass | pass | pass | pass | | cspdarknet53 | 2 | pass | pass | pass | pass | | deit_base_distilled_patch16_224 | 2 | pass | pass | pass | pass | | dla102 | 2 | pass | pass | pass | pass | | dm_nfnet_f0 | 2 | pass | pass | pass | pass | | dpn107 | 2 | pass | pass | pass | pass | | mobilenetv2_100 | 2 | pass | pass | pass | pass | | ese_vovnet19b_dw | 2 | pass | pass | pass | pass | | eca_botnext26ts_256 | 2 | pass | pass | pass | pass | | gernet_l | 2 | pass | pass | pass | pass | | jx_nest_base | 2 | pass | pass | pass | pass | | mnasnet_100 | 2 | pass | pass | pass | pass | | mixnet_l | 2 | pass | pass | pass | pass | | ghostnet_100 | 2 | pass | pass | pass | pass | | lcnet_050 | 2 | pass | pass | pass | pass | | mixer_b16_224 | 2 | pass | pass | pass | pass | | inception_v3 | 2 | pass | pass | pass | pass | | hrnet_w18 | 2 | pass | pass | pass | pass | | gmlp_s16_224 | 2 | pass | pass | pass | pass | | gmixer_24_224 | 2 | pass | pass | pass | pass | | gluon_inception_v3 | 2 | pass | pass | pass | pass | | convit_base | 2 | fail_to_run | fail_to_run | fail_to_run | fail_to_run | | eca_halonext26ts | 2 | pass | pass | fail_to_run | fail_accuracy | | fbnetv3_b | 2 | pass | pass | fail_accuracy | fail_accuracy | | gluon_xception65 | 2 | pass | pass | fail_accuracy | fail_accuracy | | poolformer_m36 | 2 | pass | pass | fail_accuracy | fail_accuracy | | spnasnet_100 | 2 | pass | pass | fail_accuracy | fail_accuracy | +---------------------------------+----+-------------+---------------+---------------+------------------------+ ~~~ Compilation latency (sec) ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | hrnet_w18 | 128 | 6.931 | 30.4857 | 150.2292 | 136.4794 | | twins_pcpvt_base | 64 | 2.9951 | 15.3979 | 130.834 | 129.5663 | | pnasnet5large | 16 | 5.6391 | 23.8783 | 92.764 | 87.4451 | | xcit_large_24_p8_224 | 5 | 3.5596 | nan | 92.0008 | 88.4883 | | cait_m36_384 | 4 | 3.815 | 19.5988 | 86.6496 | 82.3462 | | swin_base_patch4_window7_224 | 64 | 3.2903 | 13.3388 | 82.786 | 80.3175 | | resnest101e | 64 | 3.7295 | 16.5156 | 79.7768 | 72.6928 | | convnext_base | 64 | 1.5651 | 6.9611 | 77.0414 | 72.2102 | | mobilevit_s | 64 | 2.0202 | 7.6097 | 71.0608 | 67.871 | | jx_nest_base | 32 | 2.0348 | 9.2647 | 65.5891 | 63.0269 | | res2net101_26w_4s | 64 | 3.5565 | 16.9651 | 64.5817 | 60.2342 | | coat_lite_mini | 128 | 1.3165 | 5.4821 | 61.5139 | 59.5614 | | res2net50_14w_8s | 128 | 3.1746 | 14.6573 | 57.6172 | 53.9023 | | poolformer_m36 | 64 | 1.9082 | 7.4302 | 56.0511 | 52.1659 | | sebotnet33ts_256 | 64 | 1.9509 | 6.2361 | 48.1373 | 46.034 | | gmlp_s16_224 | 128 | 1.4987 | 7.4523 | 47.2443 | 44.1018 | | dpn107 | 32 | 4.306 | 13.9007 | 47.0541 | 43.9241 | | crossvit_9_240 | 128 | 1.872 | 8.655 | 45.8403 | 43.3586 | | fbnetv3_b | 128 | 3.531 | 11.7421 | 45.6063 | 42.7345 | | gluon_xception65 | 32 | 2.3146 | 11.0104 | 45.2269 | 42.718 | | volo_d1_224 | 64 | 1.4525 | 7.6563 | 45.068 | 42.3737 | | tnt_s_patch16_224 | 128 | 2.0252 | 11.3096 | 43.6365 | 40.114 | | gluon_inception_v3 | 128 | 1.8479 | 8.4126 | 39.904 | 36.691 | | eca_botnext26ts_256 | 128 | 1.5477 | 5.0427 | 39.792 | 39.2614 | | inception_v3 | 128 | 1.8263 | 8.4564 | 39.4032 | 36.5655 | | dla102 | 128 | 2.1101 | 9.6008 | 39.3235 | 36.4951 | | ghostnet_100 | 128 | 3.3877 | 9.9212 | 39.1227 | 36.6041 | | adv_inception_v3 | 128 | 1.8288 | 8.4327 | 38.6505 | 37.1302 | | gmixer_24_224 | 128 | 1.6172 | 8.3054 | 38.152 | 35.4031 | | tf_mixnet_l | 128 | 6.2003 | 12.9642 | 37.9262 | 36.0038 | | swsl_resnext101_32x16d | 32 | 2.196 | 9.2607 | 37.2668 | 34.7827 | | mixnet_l | 128 | 5.7372 | 12.878 | 37.1291 | 35.036 | | botnet26t_256 | 128 | 1.5761 | 4.4983 | 35.0677 | 34.1297 | | dm_nfnet_f0 | 128 | 2.3046 | 7.4564 | 33.9742 | 32.2667 | | res2next50 | 128 | 1.7858 | 8.2631 | 32.7063 | 30.2447 | | convit_base | 64 | 1.3665 | 6.2292 | 31.9169 | 30.7947 | | tinynet_a | 128 | 2.3455 | 8.1442 | 31.7491 | 30.1005 | | rexnet_100 | 128 | 2.1214 | 7.4928 | 31.5534 | 29.7933 | | tf_efficientnet_b0 | 128 | 2.0551 | 7.0695 | 27.8237 | 25.4269 | | cspdarknet53 | 64 | 2.6122 | 7.5394 | 27.1407 | 25.0663 | | spnasnet_100 | 128 | 2.3143 | 6.7729 | 26.5412 | 24.7965 | | mixer_b16_224 | 128 | 0.8987 | 3.7968 | 26.3356 | 25.4143 | | fbnetc_100 | 128 | 2.3512 | 7.072 | 25.8561 | 24.325 | | convmixer_768_32 | 32 | 1.3946 | 6.5936 | 25.7018 | 24.5606 | | pit_b_224 | 64 | 1.248 | 5.4003 | 25.1364 | 23.8347 | | deit_base_distilled_patch16_224 | 64 | 1.0536 | 5.3764 | 25.1177 | 25.2995 | | visformer_small | 128 | 1.0378 | 4.2325 | 25.1067 | 23.9944 | | vit_base_patch16_224 | 64 | 1.1538 | 4.7124 | 24.8839 | 23.7608 | | nfnet_l0 | 128 | 2.0544 | 7.5252 | 24.7692 | 22.9685 | | resmlp_12_224 | 128 | 0.7995 | 3.1912 | 24.6682 | 22.7338 | | mobilenetv3_large_100 | 128 | 1.8934 | 5.835 | 23.9546 | 23.1319 | | beit_base_patch16_224 | 64 | 1.4003 | 5.8776 | 23.4187 | 21.9121 | | mobilenetv2_100 | 128 | 1.9196 | 5.658 | 22.5906 | 21.5039 | | repvgg_a2 | 128 | 2.1604 | 6.1376 | 22.2589 | 21.2079 | | mnasnet_100 | 128 | 1.8818 | 5.5318 | 21.8075 | 19.8951 | | regnety_002 | 128 | 1.7886 | 5.8636 | 21.8 | 20.2519 | | gernet_l | 128 | 2.1647 | 6.2133 | 20.999 | 19.9016 | | selecsls42b | 128 | 0.9436 | 3.8595 | 18.5606 | 17.3978 | | lcnet_050 | 128 | 1.1515 | 3.4232 | 15.2332 | 14.6341 | | ese_vovnet19b_dw | 128 | 1.1361 | 3.1755 | 14.4607 | 13.635 | | eca_halonext26ts | 128 | 1.6025 | 5.1343 | nan | nan | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Peak Memory Compression Ratio ~~~ +---------------------------------+-----+--------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+--------+-----------+----------+------------------------+ | tinynet_a | 128 | 0.9889 | 0.7884 | 1.3706 | 1.5063 | | gmixer_24_224 | 128 | 0.9926 | 0.9699 | 1.3138 | 1.3772 | | gmlp_s16_224 | 128 | 0.9937 | 0.9715 | 1.2842 | 1.2997 | | tf_efficientnet_b0 | 128 | 0.9882 | 0.7693 | 1.1886 | 1.3558 | | mobilevit_s | 64 | 0.9931 | 0.7669 | 1.1741 | 1.3111 | | pnasnet5large | 16 | 1.0575 | 0.9913 | 1.1605 | 1.2933 | | rexnet_100 | 128 | 0.9885 | 0.785 | 1.1474 | 1.3179 | | eca_botnext26ts_256 | 128 | 0.9886 | 0.77 | 1.1068 | 1.2643 | | poolformer_m36 | 64 | 0.9979 | 0.9432 | 1.1021 | 1.1167 | | resnest101e | 64 | 0.995 | 0.9889 | 1.0592 | 1.1461 | | mobilenetv2_100 | 128 | 0.9863 | 0.7642 | 1.0587 | 1.152 | | tnt_s_patch16_224 | 128 | 0.9945 | 0.9729 | 1.0576 | 1.1456 | | convit_base | 64 | 0.9966 | 0.8516 | 1.0441 | 1.1492 | | dm_nfnet_f0 | 128 | 0.969 | 0.898 | 1.0332 | 1.1293 | | nfnet_l0 | 128 | 0.9884 | 0.8173 | 1.0332 | 1.1822 | | volo_d1_224 | 64 | 0.9965 | 0.9475 | 1.0227 | 1.1355 | | beit_base_patch16_224 | 64 | 0.9952 | 0.9327 | 0.9889 | 1.0322 | | fbnetv3_b | 128 | 0.9872 | 0.7836 | 0.9862 | 1.0421 | | convmixer_768_32 | 32 | 0.9972 | 0.9788 | 0.9746 | 0.9788 | | visformer_small | 128 | 0.9899 | 0.9259 | 0.9622 | 1.0521 | | dla102 | 128 | 0.9694 | 0.912 | 0.9555 | 1.031 | | ghostnet_100 | 128 | 0.9756 | 0.87 | 0.9489 | 1.0707 | | twins_pcpvt_base | 64 | 0.9945 | 0.9232 | 0.9397 | 1.076 | | tf_mixnet_l | 128 | 0.991 | 0.8555 | 0.9363 | 1.0878 | | xcit_large_24_p8_224 | 5 | 0.9975 | nan | 0.932 | 0.9931 | | mobilenetv3_large_100 | 128 | 0.9772 | 0.84 | 0.9307 | 1.0268 | | cait_m36_384 | 4 | 0.9998 | 0.9141 | 0.9288 | 0.9735 | | ese_vovnet19b_dw | 128 | 0.9858 | 0.8566 | 0.9181 | 1.0684 | | pit_b_224 | 64 | 0.999 | 0.8053 | 0.9165 | 1.1168 | | swsl_resnext101_32x16d | 32 | 0.9989 | 0.879 | 0.9112 | 0.981 | | dpn107 | 32 | 0.997 | 0.9097 | 0.9069 | 0.9966 | | res2net101_26w_4s | 64 | 0.9937 | 0.9151 | 0.8977 | 0.973 | | inception_v3 | 128 | 0.9824 | 0.8621 | 0.8975 | 1.0248 | | gluon_inception_v3 | 128 | 0.9824 | 0.8621 | 0.8975 | 1.0248 | | adv_inception_v3 | 128 | 0.9824 | 0.8621 | 0.8975 | 1.0248 | | gluon_xception65 | 32 | 0.9955 | 0.8859 | 0.8975 | 0.9763 | | fbnetc_100 | 128 | 0.98 | 0.8491 | 0.8973 | 0.9876 | | hrnet_w18 | 128 | 0.9914 | 0.9176 | 0.8969 | 1.0032 | | mixer_b16_224 | 128 | 0.992 | 0.9574 | 0.8927 | 0.963 | | selecsls42b | 128 | 0.9789 | 0.876 | 0.8926 | 0.9897 | | vit_base_patch16_224 | 64 | 0.9955 | 0.9342 | 0.8877 | 0.8929 | | deit_base_distilled_patch16_224 | 64 | 0.9944 | 0.9332 | 0.8872 | 0.8923 | | spnasnet_100 | 128 | 0.9788 | 0.8801 | 0.8795 | 0.9819 | | res2net50_14w_8s | 128 | 0.9908 | 0.9072 | 0.877 | 0.9738 | | res2next50 | 128 | 0.9913 | 0.91 | 0.8719 | 0.9671 | | mnasnet_100 | 128 | 0.9765 | 0.8701 | 0.871 | 0.9804 | | mixnet_l | 128 | 0.9902 | 0.8441 | 0.8701 | 1.0089 | | gernet_l | 128 | 0.9794 | 0.8503 | 0.8619 | 0.9858 | | cspdarknet53 | 64 | 0.9915 | 0.8405 | 0.8607 | 1.0102 | | botnet26t_256 | 128 | 0.9849 | 0.864 | 0.8503 | 0.9434 | | lcnet_050 | 128 | 0.9433 | 0.7566 | 0.8449 | 0.9432 | | regnety_002 | 128 | 0.9504 | 0.7948 | 0.8371 | 1.0078 | | convnext_base | 64 | 1.003 | 0.9263 | 0.806 | 0.9865 | | resmlp_12_224 | 128 | 0.9827 | 0.9508 | 0.7981 | 0.8121 | | sebotnet33ts_256 | 64 | 0.9928 | 0.7073 | 0.745 | 0.8294 | | coat_lite_mini | 128 | 1.0338 | 0.9202 | 0.7194 | 1.0197 | | crossvit_9_240 | 128 | 0.9854 | 0.8707 | 0.7141 | 0.9624 | | jx_nest_base | 32 | 0.9983 | 0.8927 | 0.6644 | 0.8514 | | swin_base_patch4_window7_224 | 64 | 0.9966 | 0.9203 | 0.6295 | 0.7419 | | repvgg_a2 | 128 | 0.9767 | 0.7822 | 0.5534 | 0.8298 | | eca_halonext26ts | 128 | 0.9886 | 0.7747 | nan | nan | +---------------------------------+-----+--------+-----------+----------+------------------------+ ~~~ Absolute latency (ms) ~~~ +---------------------------------+-----+----------+-----------+----------+------------------------+ | name | bs | eager | aot_eager | inductor | inductor_no_cudagraphs | +---------------------------------+-----+----------+-----------+----------+------------------------+ | convmixer_768_32 | 32 | 296.486 | 296.8557 | 280.7942 | 282.1234 | | tnt_s_patch16_224 | 128 | 363.6214 | 364.7147 | 189.0574 | 191.9649 | | hrnet_w18 | 128 | 297.7562 | 289.8731 | 188.7442 | 221.1564 | | convnext_base | 64 | 121.4143 | 121.6429 | 183.0963 | 187.5732 | | pnasnet5large | 16 | 229.4869 | 221.4024 | 168.9324 | 173.7203 | | tf_mixnet_l | 128 | 195.1447 | 210.0817 | 162.2243 | 162.9431 | | mixnet_l | 128 | 186.6718 | 201.9991 | 157.43 | 158.2623 | | convit_base | 64 | 181.2822 | 181.7059 | 130.2669 | 137.5216 | | pit_b_224 | 64 | 154.8196 | 155.4562 | 117.5385 | 118.1064 | | cait_m36_384 | 4 | 165.9859 | 164.6621 | 117.3606 | 121.8195 | | dla102 | 128 | 178.2148 | 179.1855 | 112.827 | 115.1884 | | poolformer_m36 | 64 | 148.8974 | 149.0187 | 112.0757 | 114.8132 | | beit_base_patch16_224 | 64 | 134.9152 | 137.7806 | 108.2304 | 109.6662 | | resnest101e | 64 | 167.9436 | 165.3773 | 108.0546 | 113.5857 | | adv_inception_v3 | 128 | 160.9935 | 161.5713 | 107.1038 | 109.8873 | | inception_v3 | 128 | 160.6292 | 161.0654 | 107.0841 | 109.5081 | | gluon_inception_v3 | 128 | 160.974 | 161.5206 | 107.0629 | 109.3358 | | vit_base_patch16_224 | 64 | 120.4637 | 121.1913 | 104.0355 | 104.985 | | swsl_resnext101_32x16d | 32 | 117.7744 | 120.0279 | 103.9766 | 111.3971 | | res2net50_14w_8s | 128 | 145.4328 | 146.8044 | 99.6889 | 104.0853 | | swin_base_patch4_window7_224 | 64 | 147.0818 | 153.2668 | 99.4041 | 104.0568 | | res2next50 | 128 | 138.6325 | 138.6725 | 97.7916 | 102.4728 | | mixer_b16_224 | 128 | 118.3458 | 118.6056 | 94.006 | 94.6202 | | dpn107 | 32 | 114.1541 | 115.883 | 93.7816 | 91.7772 | | gmlp_s16_224 | 128 | 136.292 | 136.5303 | 89.4771 | 90.6161 | | jx_nest_base | 32 | 118.8976 | 119.7334 | 87.3509 | 89.5918 | | dm_nfnet_f0 | 128 | 131.6929 | 131.5566 | 87.1907 | 91.5997 | | volo_d1_224 | 64 | 134.5478 | 134.9864 | 86.5848 | 88.2192 | | eca_botnext26ts_256 | 128 | 112.1036 | 135.4767 | 86.3948 | 86.6135 | | gluon_xception65 | 32 | 97.8576 | 98.5746 | 84.3137 | 86.6696 | | fbnetv3_b | 128 | 120.8277 | 122.5649 | 83.0267 | 84.5648 | | gmixer_24_224 | 128 | 119.7908 | 136.0844 | 80.2727 | 80.8411 | | visformer_small | 128 | 98.1431 | 97.9784 | 79.8902 | 83.4777 | | botnet26t_256 | 128 | 106.0373 | 106.5229 | 78.4519 | 77.8833 | | crossvit_9_240 | 128 | 109.2776 | 109.8997 | 78.2862 | 79.7036 | | res2net101_26w_4s | 64 | 121.7017 | 129.0133 | 77.7325 | 95.1401 | | twins_pcpvt_base | 64 | 125.2206 | 143.6159 | 76.4556 | 81.9545 | | deit_base_distilled_patch16_224 | 64 | 94.1628 | 94.926 | 75.9289 | 76.9224 | | coat_lite_mini | 128 | 115.747 | 117.2487 | 72.9963 | 73.6657 | | gernet_l | 128 | 79.6333 | 80.5474 | 70.9857 | 70.0816 | | cspdarknet53 | 64 | 95.9161 | 96.6293 | 69.3552 | 68.1578 | | rexnet_100 | 128 | 90.8942 | 103.0498 | 68.8717 | 68.7081 | | repvgg_a2 | 128 | 79.6315 | 80.322 | 68.2501 | 67.1488 | | nfnet_l0 | 128 | 106.2833 | 131.0196 | 68.098 | 72.2292 | | sebotnet33ts_256 | 64 | 83.2625 | 96.032 | 66.8136 | 66.9768 | | tf_efficientnet_b0 | 128 | 90.5795 | 108.2595 | 65.9276 | 64.3744 | | mobilevit_s | 64 | 89.9782 | 107.4419 | 64.2514 | 64.3669 | | xcit_large_24_p8_224 | 5 | 128.6823 | nan | 62.0273 | 73.0838 | | fbnetc_100 | 128 | 87.9137 | 88.9833 | 61.9827 | 60.9264 | | tinynet_a | 128 | 75.7975 | 90.8109 | 58.0368 | 60.6362 | | spnasnet_100 | 128 | 76.555 | 77.3926 | 53.6136 | 54.5766 | | resmlp_12_224 | 128 | 68.1068 | 68.3201 | 51.2553 | 52.5835 | | ese_vovnet19b_dw | 128 | 67.7937 | 68.2858 | 47.9989 | 47.7794 | | mnasnet_100 | 128 | 69.989 | 70.7781 | 47.2192 | 45.7042 | | ghostnet_100 | 128 | 95.9114 | 97.6094 | 46.0617 | 54.3073 | | mobilenetv2_100 | 128 | 67.4917 | 68.2319 | 45.9347 | 44.8565 | | selecsls42b | 128 | 62.7781 | 62.9914 | 43.545 | 44.503 | | mobilenetv3_large_100 | 128 | 66.0257 | 66.6214 | 43.415 | 44.0017 | | regnety_002 | 128 | 57.9244 | 60.1452 | 25.5235 | 37.5032 | | lcnet_050 | 128 | 34.0898 | 34.6161 | 16.3715 | 20.7263 | | eca_halonext26ts | 128 | 115.8373 | 139.2614 | nan | nan | +---------------------------------+-----+----------+-----------+----------+------------------------+ ~~~

Performance graphs

../test-dynamo-runner-logs-12/huggingface_amp.png : ![](https://i.imgur.com/oS2MDqP.png) ../test-dynamo-runner-logs-12/timm_models_amp.png : ![](https://i.imgur.com/IdNOAN8.png) ../test-dynamo-runner-logs-12/torchbench_amp.png : ![](https://i.imgur.com/llM1b2H.png)

Build Summary

### Run name ### day_327_23_11_22_performance_amp_817 ### Commit hashes ### pytorch commit: 902e4e3926a9333178510f032580e4acd56c40da pytorch commit date: 2022-11-23 19:05:14+00:00 functorch Absent torchbench commit: 63d4037c8738908f3edfb3f7af69888378f57929 torchbench commit date: 2022-11-03 11:18:02-07:00 ### TorchDynamo config flags ### torch._dynamo.config.HAS_REFS_PRIMS = True torch._dynamo.config.capture_scalar_outputs = False torch._dynamo.config.dead_code_elimination = True torch._dynamo.config.dynamic_propagation = True torch._dynamo.config.dynamic_shapes = False torch._dynamo.config.enforce_cond_guards_match = True torch._dynamo.config.error_on_nested_fx_trace = True torch._dynamo.config.fake_tensor_propagation = True torch._dynamo.config.guard_nn_modules = False torch._dynamo.config.normalize_ir = False torch._dynamo.config.optimize_ddp = False torch._dynamo.config.print_graph_breaks = False torch._dynamo.config.raise_on_ctx_manager_usage = True torch._dynamo.config.raise_on_unsafe_aot_autograd = False torch._dynamo.config.replay_record_enabled = False torch._dynamo.config.specialize_int_float = True torch._dynamo.config.suppress_errors = False torch._dynamo.config.verbose = False torch._dynamo.config.verify_correctness = False ### Torch version ### torch: 1.14.0.dev20221114+cu116 ### Environment variables ### TORCH_CUDA_ARCH_LIST = 8.0 CUDA_HOME = /usr/local/cuda-11.6 USE_LLVM = /usr/lib/llvm-10 ### GPU details ### CUDNN VERSION: 8302 Number CUDA Devices: 2 Device Name: NVIDIA A100-SXM4-40GB Device Memory [GB]: 42.314694656

pytorch / torchdynamo

Test Issue for Dashboard Improvements #1831

Performance Dashboard for float32 precision

Executive Summary

Warnings

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Accuracy Regressions

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision

Performance graphs

Performance Dashboard for float32 precision

Executive Summary

Warnings

Metrics over time

Accuracy Regressions

torchbench suite with float32 precision

huggingface suite with float32 precision

timm_models suite with float32 precision