opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
286 stars 195 forks source link

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Closed eero-t closed 2 months ago

eero-t commented 5 months ago

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

eero-t commented 5 months ago

Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512

= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.

On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).

kevinintel commented 3 months ago

we can add info in docs to remind user close bf16 on specific machines.

lkk12014402 commented 2 months ago

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

hi @eero-t the node-feature-discovery plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.

we push a pr to provide the recipe to label node and setup tgi with bfloat16, see https://github.com/opea-project/GenAIExamples/pull/795

eero-t commented 2 months ago

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

lianhao commented 2 months ago

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

That's on our plan

kevinintel commented 2 months ago

We add bf16 in Readme of docker