Add bf16 kernel support

Export bf16 is good, require kernel support now.

(.venv) (base) [lfq@devvm20128.prn0 /data/users/lfq/torchchat (lfq.export-bf16)]$ python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cpu Intel Core Processor (Broadwell)
Loading model...
Time to load model: 0.11 seconds
I 00:00:00.000905 executorch:program.cpp:129] InternalConsistency verification requested but not available
E 00:00:51.744419 executorch:method.cpp:936] Overriding output data pointer allocated by memory plan is not allowed.
I 00:00:51.744460 executorch:pybindings.cpp:196] Cannot set_output_data_ptr(): this likely means the outputs were MemoryPlanned inspect the error code to know for sure, but likely this is not an issue. 0x2
F 00:00:51.747880 executorch:op_index.cpp:87] In function operator()(), assert failed (false): Unhandled dtype BFloat16 for index.Tensor_out
Aborted (core dumped)

pytorch / executorch

Add bf16 kernel support #3488