Open blincoln-bf opened 2 weeks ago
Hi @blincoln-bf
Could you provide the model of your CPU? Generally, CPUs like the 13th Gen Intel(R) Core(TM) i5-13600K do not have specialized instruction sets for BF16 and FP16 data formats. This results in the CPU needing to convert data types to FP32 for computation and then back to BF16 or FP16. This back-and-forth process consumes a significant amount of time.
I think your approach to benchmarking performance using different torch_dtypes
is thorough and provides valuable insights into the performance differences on a CPU. The significant slowdown with torch.float16
across different models reinforces the need for warnings when this dtype is used on CPUs.
Your might consider sharing their findings on relevant forums or with the maintainers of the Transformers library, as this could help other users avoid similar pitfalls. Adding an informative warning message when torch.float16
is used on a CPU is a practical and user-friendly solution.
Overall, the methodology and detailed performance metrics make a strong case for more awareness around dtype performance implications.
Hi @Kevin0624.
The system where I ran that benchmark script has an AMD Ryzen 9 7950X3D (16 cores) and 128 GiB of RAM, in addition to an RTX 4090.
Regardless of the reason, it seems like warning the user in the documentation, or if they've specified a very inefficient torch_dtype
for the device would be a good idea, at least in the case of float16
, since the .half()
approach is suggested in so many Transformers tutorials. As I mentioned, that script just demonstrates text generation, where the difference is about an order of magnitude. For more complex work, the difference can be truly unbelievable. The most dramatic example I've found so far is with gpt-j-6b
, where an operation in my project takes 25+ hours on the CPU with torch_dtype = float16
, versus three minutes if the model is loaded with torch_dtype = bfloat16
. For gpt-neox-20b
, its 10+ hours versus 20 minutes. For Phi-3-medium-128k-instruct
, it's four hours versus 2 minutes, 30 seconds. This is with no other changes to the code, only that one flag during the model load operation.
However, you are correct in that it's very device-dependent. I ran the same benchmark on an M1 MacBook (using Qwen2-0.5B-Instruct
, because that MacBook only has 16 GiB of RAM), and got very different results:
Finished processing prompt 0 at 2024-11-12T11:45:44 - 2.875962 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T11:45:46 - 1.427849 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T11:45:49 - 3.508065 seconds elapsed in total for this prompt.
Finished test with torch_dtype = None at 2024-11-12T11:45:49 - 8.919457 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T11:45:52 - 2.56628 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T11:45:54 - 1.469897 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T11:45:57 - 3.528596 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.float32 at 2024-11-12T11:45:57 - 7.982476 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T11:45:59 - 0.924022 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T11:45:59 - 0.484926 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T11:46:00 - 0.912076 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.float16 at 2024-11-12T11:46:00 - 2.669201 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T11:46:01 - 0.842285 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T11:46:02 - 0.600835 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T11:46:03 - 1.044211 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.bfloat16 at 2024-11-12T11:46:03 - 2.711551 seconds elapsed for the entire test cycle using this dtype.
I'll run that script with Qwen2-0.5B-Instruct
on a Windows laptop with an Intel processor and 16 GiB of RAM a little later today for comparison.
Script output for Qwen2-0.5B-Instruct
on a Windows laptop with an Intel Core i5-10310U and 16 GiB of RAM. On this device, bfloat16
performance is almost as slow as float16
, so I'm even more glad I went with AMD and Linux for my ML research system :).
Finished processing prompt 0 at 2024-11-12T12:45:03 - 9.454424 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T12:45:07 - 4.606776 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T12:45:18 - 10.964283 seconds elapsed in total for this prompt.
Finished test with torch_dtype = None at 2024-11-12T12:45:18 - 29.24945 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T12:45:29 - 8.341307 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T12:45:33 - 4.570537 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T12:45:44 - 10.71891 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.float32 at 2024-11-12T12:45:44 - 25.648423 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T12:46:45 - 59.440162 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T12:47:37 - 52.715362 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T12:48:47 - 69.450205 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.float16 at 2024-11-12T12:48:47 - 182.775335 seconds elapsed for the entire test cycle using this dtype.
Finished processing prompt 0 at 2024-11-12T12:49:44 - 56.260634 seconds elapsed in total for this prompt.
Finished processing prompt 1 at 2024-11-12T12:50:29 - 45.242265 seconds elapsed in total for this prompt.
Finished processing prompt 2 at 2024-11-12T12:51:29 - 59.411733 seconds elapsed in total for this prompt.
Finished test with torch_dtype = torch.bfloat16 at 2024-11-12T12:51:29 - 161.52774 seconds elapsed for the entire test cycle using this dtype.
In case you'd like to see some statistics that document the effect on a larger codebase, I ran through benchmarks on four different systems (one of which dual boots Linux and Windows) and documented them here:
TLDR:
float32
is the only good option, although I'd like to test on some different Intel CPUs to verify.float32
and bfloat16
are both great choices, but float16
is significantly slower than either of them.
System Info
Transformers versions: 4.44.2, 4.46.2 PyTorch versions: 2.4.0, 2.5.1 Python version: 3.11.2
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
While troubleshooting very weird behaviour with GPT-NeoX when processed on a CPU device, I discovered that Transformers will load a model with
torch_dtype = torch.float16
and process it on the CPU without any apparent warning or other message, but its performance is very slow compared tofloat32
(5+ times slower) orbfloat16
(10+ times slower). Given the amount of documentation online that suggests using.half()
ortorch_dtype = torch.float16
to conserve memory, can I suggest adding a warning message when a model loaded in this format is processed on a CPU device? I knowfloat16
support for CPU at all is relatively new, but given the lack of information anywhere about the massive performance hit it currently incurs, I assumed CPU processing as a whole in Transformers was essentially unusable for real-world work (especially training / gradient operations). In reality, CPU processing is surprisingly fast when set tofloat32
orbfloat16
format.Here's a quick benchmark script based on the example usage for GPT-NeoX that loads the model, then generates text using three prompts. It performs this test for
torch_dtype = None
,torch.float32
,torch.float16
and ,torch.bfloat16
:Excerpt of the output with just the relevant statistics:
As you can see,
float16
performance scored about 7 times worse thanfloat32
for this run, and about 14 times worse thanbfloat16
, with simple text generation taking almost ten minutes infloat16
format. For training/gradient operations, the effect is even more of a problem. Operations that take a few minutes in the other formats can take hours infloat16
format (in the case of the GPT-NeoX issue, 10+ hours for a call toforward
). I don't have a good minimal test case for that, though.This is not limited to GPT-NeoX. For example, here's the same script, but modified to use Phi-3-mini-128k instead:
Relevant output for Phi-3-mini-128k:
In this case, the difference is 5 times worse than
float32
, and 10 times worse thanbfloat16
overall. There seems to be some kind of fixed overhead causing the issue, because the processing times for both Phi-3-mini-128k and GPT-NeoX infloat16
form are virtually identical, even when they vary by several times for the data in other formats.I assume the discrepancy is at least sort of a known issue to the Transformers developers, but I only discovered it myself when trying to debug a different problem. Adding a runtime warning and maybe an explicit warning in the documentation seems like it would be a good idea.
Expected behavior
If CPU processing is performed using a very inefficient format that is also commonly suggested as a way to reduce the memory footprint, I would expect Transformers to issue a warning.