opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
248 stars 171 forks source link

image opea/tei-gaudi:v0.7 doesn't work on latest gaudi sw stack v1.16.2 #426

Closed lianhao closed 2 months ago

lianhao commented 3 months ago

Using the image opea/tei-gaudi:v0.7 on a gaudi-enabled k8s cluster doesn't work. When launch the pod, it will fail with the following error message during start up:

Traceback (most recent call last):
  File "/usr/local/bin/python-text-embeddings-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 716, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/src/backends/python/server/text_embeddings_server/cli.py", line 50, in serve
    server.serve(model_path, dtype, uds_path)
  File "/usr/src/backends/python/server/text_embeddings_server/server.py", line 79, in serve
    asyncio.run(serve_inner(model_path, dtype))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/backends/python/server/text_embeddings_server/server.py", line 48, in serve_inner
    model = get_model(model_path, dtype)
  File "/usr/src/backends/python/server/text_embeddings_server/models/__init__.py", line 66, in get_model
    return DefaultModel(model_path, device, dtype)
  File "/usr/src/backends/python/server/text_embeddings_server/models/default_model.py", line 23, in __init__
    model = AutoModel.from_pretrained(model_path).to(dtype).to(device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2556, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
    result = self.original_to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1155, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1153, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
    return super().__torch_function__(func, types, new_args, kwargs)
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

We're using the model BAAI/bge-base-en-v1.5

However if I manually build opea/tei-gaudi image based on tei-gaudi tag synapse_1.16, it seems working.

We should release a new opea/tei-gaudi image

My test environment is:

Host Environment: Ubuntu 22.04 with kernel 5.15.0-92-generic K8S ver: v1.29.5 containerd ver: 1.7.19 Gaudi SW stack:

$ sudo hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version:                           hl-1.16.2-rc-fw-50.1.2.0          |
| Driver Version:                                     1.16.2-f195ec4          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   31C   N/A    83W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   28C   N/A    90W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   28C   N/A    86W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:cc:00.0     N/A |                   0  |
| N/A   30C   N/A    87W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   28C   N/A    82W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   29C   N/A    97W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   28C   N/A    84W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   27C   N/A    74W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+
srinarayan-srikanthan commented 2 months ago

This error generally means, it is unable to recognize the devices. And from hl-smi, it seems that you do not have any process running. Can you try the latest image please. Or can you provide me the steps that you are doing for me to be able to reproduce the issue.

lianhao commented 2 months ago

The latest upstream image which is recently released seems to be working on gaudi sw v1.16.x. Maybe we should advice users to use upstream image instead of opea/tei-gaudi? I believe we created opea/tei-gaudi because by that time there was no published upstream image

srinarayan-srikanthan commented 2 months ago

yes, i was also run it with the latest image on 1.16. This image is particularly for 1.16.x release https://github.com/huggingface/tei-gaudi/pkgs/container/tei-gaudi/241185933?tag=synapse_1.16. Yes we can include validated configurations for the examples maybe.