Closed jerin-scalers-ai closed 5 months ago
The inference time in bfloat16 depends on the hardware you are using and the pytorch version as well. Recommending you to use float16 for inference.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Transformers: 4.35.2 Torch: 2.1.1-cpu CPU: Intel Xeon 4th Gen processor
Who can help?
@ArthurZucker Hi, I was comparing performance of Llama 2 7b chat hf model with different precisions. I observed that there is a significant degrade on performance (inference time) with bfloat16 compared to fp32 model in Intel CPU . Bf16 is suppose to give better performance than fp32 . Please refer below table for details:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Bf16 is suppose to give better performance than fp32