Closed eero-t closed 2 months ago
Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512
= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.
On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).
we can add info in docs to remind user close bf16 on specific machines.
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml +++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml @@ -28,6 +29,8 @@ spec: args: - --model-id - $(LLM_MODEL_ID) + - --dtype + - bfloat16 #- "/data/Llama-2-7b-hf" # - "/data/Mistral-7B-Instruct-v0.2" # - --quantize
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using
node-feature-discovery
and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpuIt would be good to add some documentation and examples (e.g. comment lines in YAML) for this.
hi @eero-t the node-feature-discovery
plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.
we push a pr to provide the recipe to label node and setup tgi with bfloat16, see https://github.com/opea-project/GenAIExamples/pull/795
Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:
Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:
That's on our plan
We add bf16 in Readme of docker
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using
node-feature-discovery
and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpuIt would be good to add some documentation and examples (e.g. comment lines in YAML) for this.