FbgemmF16 is not always faster than "PrePacked" OpenBLAS SGEMM

lifeiteng commented 5 years ago

Environment:

MacBook Pro (15-inch, 2017)
2.9 GHz Intel Core i7

Change to OpenBLAS:

https://github.com/xianyi/OpenBLAS/blob/v0.3.4/driver/level3/level3.c#L322 and https://github.com/xianyi/OpenBLAS/blob/v0.3.4/driver/level3/level3.c#L377 are commented out as we can pre pack it like Fbgemm
```
export OPENBLAS_NUM_THREADS=1
export GOTO_NUM_THREADS=1
export OMP_NUM_THREADS=1
```

Test results(cblas_sgemm no transpose):

   OPENBLAS_PrePacked_FP32 m =     1 n =  3072 k =  1024 Gflops =  17.4617 GBytes =  17.5072
    FBGEMM_N6fbgemm5__f16E m =     1 n =  3072 k =  1024 Gflops =  12.0319 GBytes =  12.0632

   OPENBLAS_PrePacked_FP32 m =     2 n =  3072 k =  1024 Gflops =  30.1207 GBytes =  15.1388
    FBGEMM_N6fbgemm5__f16E m =     2 n =  3072 k =  1024 Gflops =  20.4900 GBytes =  10.2984

   OPENBLAS_PrePacked_FP32 m =     3 n =  3072 k =  1024 Gflops =  29.9095 GBytes =  10.0477
    FBGEMM_N6fbgemm5__f16E m =     3 n =  3072 k =  1024 Gflops =  32.2804 GBytes =  10.8442

   OPENBLAS_PrePacked_FP32 m =     4 n =  3072 k =  1024 Gflops =  63.1276 GBytes =  15.9463
    FBGEMM_N6fbgemm5__f16E m =     4 n =  3072 k =  1024 Gflops =  43.8964 GBytes =  11.0884

   OPENBLAS_PrePacked_FP32 m =     5 n =  3072 k =  1024 Gflops =  49.5468 GBytes =  10.0384
    FBGEMM_N6fbgemm5__f16E m =     5 n =  3072 k =  1024 Gflops =  54.3773 GBytes =  11.0171

   OPENBLAS_PrePacked_FP32 m =     6 n =  3072 k =  1024 Gflops =  58.9594 GBytes =   9.9801
    FBGEMM_N6fbgemm5__f16E m =     6 n =  3072 k =  1024 Gflops =  61.0574 GBytes =  10.3352

   OPENBLAS_PrePacked_FP32 m =     7 n =  3072 k =  1024 Gflops =  47.5032 GBytes =   6.9099
    FBGEMM_N6fbgemm5__f16E m =     7 n =  3072 k =  1024 Gflops =  66.0421 GBytes =   9.6066

   OPENBLAS_PrePacked_FP32 m =     8 n =  3072 k =  1024 Gflops =  77.7143 GBytes =   9.9167
    FBGEMM_N6fbgemm5__f16E m =     8 n =  3072 k =  1024 Gflops =  69.4565 GBytes =   8.8629

   OPENBLAS_PrePacked_FP32 m =     1 n =  2688 k =   896 Gflops =  18.7028 GBytes =  18.7584
    FBGEMM_N6fbgemm5__f16E m =     1 n =  2688 k =   896 Gflops =  11.3259 GBytes =  11.3596

   OPENBLAS_PrePacked_FP32 m =     2 n =  2688 k =   896 Gflops =  33.0037 GBytes =  16.6001
    FBGEMM_N6fbgemm5__f16E m =     2 n =  2688 k =   896 Gflops =  22.9513 GBytes =  11.5439

   OPENBLAS_PrePacked_FP32 m =     3 n =  2688 k =   896 Gflops =  28.6379 GBytes =   9.6312
    FBGEMM_N6fbgemm5__f16E m =     3 n =  2688 k =   896 Gflops =  32.1949 GBytes =  10.8275

   OPENBLAS_PrePacked_FP32 m =     4 n =  2688 k =   896 Gflops =  57.9563 GBytes =  14.6616
    FBGEMM_N6fbgemm5__f16E m =     4 n =  2688 k =   896 Gflops =  42.1195 GBytes =  10.6552

   OPENBLAS_PrePacked_FP32 m =     5 n =  2688 k =   896 Gflops =  49.6843 GBytes =  10.0847
    FBGEMM_N6fbgemm5__f16E m =     5 n =  2688 k =   896 Gflops =  50.0093 GBytes =  10.1507

   OPENBLAS_PrePacked_FP32 m =     6 n =  2688 k =   896 Gflops =  58.1400 GBytes =   9.8630
    FBGEMM_N6fbgemm5__f16E m =     6 n =  2688 k =   896 Gflops =  63.4359 GBytes =  10.7614

   OPENBLAS_PrePacked_FP32 m =     7 n =  2688 k =   896 Gflops =  51.4508 GBytes =   7.5032
    FBGEMM_N6fbgemm5__f16E m =     7 n =  2688 k =   896 Gflops =  63.0543 GBytes =   9.1954

   OPENBLAS_PrePacked_FP32 m =     8 n =  2688 k =   896 Gflops =  77.9591 GBytes =   9.9769
    FBGEMM_N6fbgemm5__f16E m =     8 n =  2688 k =   896 Gflops =  71.0457 GBytes =   9.0922

dskhudia commented 5 years ago

Hi @lifeiteng

Thanks a lot for reporting this. Do you mind sharing reproduction instructions? i.e., Is it just the changes to OpenBLAS ones that you mentioned? I want to benchmark it on our servers.

Please keep in mind that FbgemmF16 is designed to save on the BW used by loading of pre-packed B matrix (weight matrix during inference). Computations still happen in fp32 after converting fp16 values to fp32 in the inner kernel. Also, FbgemmF16 at the moment is tuned for server class CPUs (think bigger caches).

Thanks Daya

lifeiteng commented 5 years ago

Hi @lifeiteng

Thanks a lot for reporting this. Do you mind sharing reproduction instructions? i.e., Is it just the changes to OpenBLAS ones that you mentioned? I want to benchmark it on our servers.

Please keep in mind that FbgemmF16 is designed to save on the BW used by loading of pre-packed B matrix (weight matrix during inference). Computations still happen in fp32 after converting fp16 values to fp32 in the inner kernel. Also, FbgemmF16 at the moment is tuned for server class CPUs (think bigger caches).

Thanks Daya

yes, just comment out two lines as mentioned.

mjanderson09 commented 2 years ago

Closing due to no recent activity.

pytorch / FBGEMM

FbgemmF16 is not always faster than "PrePacked" OpenBLAS SGEMM #41