Open askervin opened 1 week ago
Some potentially relevant options from TEI readme: https://github.com/huggingface/text-embeddings-inference/blob/main/README.md
which could be used when TEI containers are running on CPUs:
--tokenization-workers <TOKENIZATION_WORKERS>
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation.
Default to the number of CPU cores on the machine
[env: TOKENIZATION_WORKERS=]
...
--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
The maximum amount of concurrent requests for this particular deployment.
Having a low limit will refuse clients requests instead of having them wait for too long and is usually good
to handle backpressure correctly
[env: MAX_CONCURRENT_REQUESTS=]
[default: 512]
Currently only --auto-truncate
option is used:
This bug blocks proper ChatQnA platform optimization demo on Xeon.
@eero-t, thanks for the pointers.
Adjusting --tokenization-workers 8
dropped thread count from 139 to 82 in my test system (128 vCPUs). But it did not effect pinning. --max-concurrent-requests
had no effect what so ever.
kubectl logs -n akervine chatqna-tei-...
When run with limited tokenization-workers looks like this:
...
2024-09-09T06:21:46.741775Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 8 tokenization workers
2024-09-09T06:21:46.773151Z INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-09T06:21:46.786104Z WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 26, index: 2, mask: {3, 67, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:21:46.786114Z WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 25, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:21:46.786115Z WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 24, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
...
while corresponding lines without tokenization-workers limit are:
...
2024-09-09T06:32:24.471354Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 64 tokenization workers
2024-09-09T06:32:24.708141Z INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-09T06:32:24.720606Z WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 79, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:32:24.720644Z WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 80, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
...
It looks like there are two pools of threads: tokenization workers and model backend. By default both contain equally many threads as there are physical CPU cores in the whole system. Each model backend thread tries to get a CPU affinity to both hyperthreads of each physical CPU core in the system (above output "mask: 1, 65" are hyperthreads of the same CPU core). Obviously only few succeed, as they happen to ask affinity to the CPUs that are included in the allowed CPUs for the container. There is no CPU affinity for threads in the tokenization worker pool.
@yinghu5, @yongfengdu, I think this issue is not limited to Kubernetes. The same problem is expected when using docker with --cpuset-cpus
.
Opened two more precisely targeted bug reports against text-embeddings-interface, because the above issue has been written as a feature request. This is a bug that, for comparison, does not exist in text-generation-interface.
Links to issues: Tokenizer threads: https://github.com/huggingface/text-embeddings-inference/issues/404 Model backend threads: https://github.com/huggingface/text-embeddings-inference/issues/405
Did you find any ENV/parameter settings that can workaround this? If there is workaround, we can implement them in the helm chart before upstreaming fixes. This is what mentioned in that issue but not sure if that works:
Priority
P2-High
OS type
Ubuntu
Hardware type
Xeon-SPR
Installation method
Deploy method
Running nodes
Single Node
What's the version?
Observed with latest chatqna.yaml (git 67394b88) where tei and teirerank containers use image:
ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
Description
When managing CPU affinity (with NRI resource policies or Kubernetes cpu-manager) on a node and creating ChatQnA/kubernetes/manifests/xeon/chatqna.yaml, tei and teirerank containers do not handle properly their internal threading and thread-CPU affinities.
They seem to create a thread for every CPU in the system, yet they should create a thread for every CPU allowed for the container.
In the logs it looks like this:
And in the system's process/thread's CPU affinity level like this:
That is, only few threads got correct CPU pinning, the rest (that are way too many) run on all allowed CPUs for the container. As a result this destroys the performance of tei and teirerank on CPU.
The log looks like the ort library is trying to create a thread and set affinity for every CPU in the system while it should not try to use any other than allowed CPUs (limited by cgroups cpuset.cpus). Cannot say if the root cause is in the ort library or how it is used here.
Reproduce steps
Install the balloons NRI policy to manage CPUs.
Replace the default balloons configuration with one that runs tei/tei-rerank on dedicated CPUs.
Deploy the chatqna yaml
Follow logs from chatqna-tei and chatqna-teirerank.
Raw log
No response