intel / ai-reference-models

Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Intel® Data Center GPUs
Apache License 2.0
674 stars 220 forks source link

Distilbert run on CPU #181

Open swamysaranam opened 3 months ago

swamysaranam commented 3 months ago

I have followed the instructions provided in https://github.com/intel/models/blob/master/quickstart/language_modeling/pytorch/distilbert_base/inference/cpu/README.md

I am using the quickstart/language_modeling/pytorch/distilbert_base/inference/cpu/run_multi_instance_realtime.sh script to estimate the latency and profile.

I enabled the following environment variables to profile the run: export DNNL_VERBOSE=1 export DNNL_VERBOSE_TIMESTAMP=1

I notice that the log contains layer_normalization, eltwise in forwardtraining, but there is log related to matmul/inner product or softmax in the latency*.log file.

Pasting the partial dump for reference:

onednn_verbose,1718391711600.335938,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0361328 onednn_verbose,1718391711605.524902,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0251465 onednn_verbose,1718391711609.137939,primitive,exec,cpu,eltwise,jit:avx2,forward_training,data_f32::blocked:abc::f0 diff_undef::undef:::,attr-scratchpad:user ,alg:eltwise_gelu_erf alpha:0 beta:0,1x384x3072,0.191162 onednn_verbose,1718391711613.104980,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0349121 onednn_verbose,1718391711618.121094,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0168457 onednn_verbose,1718391711621.819092,primitive,exec,cpu,eltwise,jit:avx2,forward_training,data_f32::blocked:abc::f0 diff_undef::undef:::,attr-scratchpad:user ,alg:eltwise_gelu_erf alpha:0 beta:0,1x384x3072,0.210938 onednn_verbose,1718391711625.866943,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0371094 onednn_verbose,1718391711630.948975,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0161133 onednn_verbose,1718391711634.527100,primitive,exec,cpu,eltwise,jit:avx2,forward_training,data_f32::blocked:abc::f0 diff_undef::undef:::,attr-scratchpad:user ,alg:eltwise_gelu_erf alpha:0 beta:0,1x384x3072,0.24292 onednn_verbose,1718391711638.688965,primitive,exec,cpu,layer_normalization,jit:uni,forward_training,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0 stats_f32::blocked:ab::f0 scale_f32::blocked:a::f0 shift_f32::blocked:a::f0,attr-scratchpad:user ,flags:CH,1x384x768,0.0380859 ^M100%|██████████| 100/100 [00:14<00:00, 6.51it/s] ^M 0%| | 0/100 [00:00<?, ?it/s]^[[A ^M 96%|█████████▌| 96/100 [00:00<00:00, 952.32it/s]^[[A^M100%|██████████| 100/100 [00:00<00:00, 954.63it/s] ^M100%|██████████| 100/100 [00:15<00:00, 6.60it/s] ** eval metrics ** eval_exact_match = 85.0 eval_f1 = 90.5 eval_samples = 100

Kindly let me know if I am using correct scripts. If so, can someone please help me in generating the full log.

P.S: Similar behavior is noticed for bert_base as well.

Thanks, Swamy.