meta-llama / llama-stack-apps

Agentic components of the Llama Stack APIs
MIT License
3.23k stars 320 forks source link

Error: Failed to initialize the TMA descriptor 801 #30

Open anret opened 1 month ago

anret commented 1 month ago

Good day everyone, I am trying to run llama agentic system on RTX4090 with FP8 Quantization for the inference model and meta-llama/Llama-Guard-3-8B-INT8 for the Guard. WIth sufficiently small max_seq_len everything fits into 24GB VRAM and I can start inference server, and chat app. However as soon I send message in the chat I get the following error: "Error: Failed to initialize the TMA descriptor 801".

(venv) trainer@pc-aiml:~/.llama$ llama inference start --disable-ipv6
/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/utils.py:43: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  initialize(config_path=relative_path)
Loading config from : /home/trainer/.llama/configs/inference.yaml
Yaml config:
------------------------
inference_config:
  impl_config:
    impl_type: inline
    checkpoint_config:
      checkpoint:
        checkpoint_type: pytorch
        checkpoint_dir: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original
        tokenizer_path: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
        model_parallel_size: 1
        quantization_format: bf16
    quantization:
      type: fp8
    torch_seed: null
    max_seq_len: 2048
    max_batch_size: 1

------------------------
Listening on 0.0.0.0:5000
INFO:     Started server process [20033]
INFO:     Waiting for application startup.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
Using efficient FP8 operators in FBGEMM.
Quantizing fp8 weights from bf16...
Loaded in 7.05 seconds
Finished model load YES READY
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     127.0.0.1:55838 - "POST /inference/chat_completion HTTP/1.1" 200 OK
TMA Desc Addr:   0x7ffdd6221440
format         0
dim            3
gmem_address   0x7eb74f4bde00
globalDim      (4096,53,1,1,1)
globalStrides  (1,4096,0,0,0)
boxDim         (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr:   0x7ffdd6221440
format         0
dim            3
gmem_address   0x7eb3ea000000
globalDim      (4096,14336,1,1,1)
globalStrides  (1,4096,0,0,0)
boxDim         (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr:   0x7ffdd6221440
format         9
dim            3
gmem_address   0x7eb3e9c00000
globalDim      (14336,53,1,1,1)
globalStrides  (2,28672,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr:   0x7ffdd6221440
format         9
dim            3
gmem_address   0x7eb3e9c00000
globalDim      (14336,53,1,1,1)
globalStrides  (2,28672,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 801
[debug] got exception cutlass cannot initialize
Traceback (most recent call last):
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 80, in retrieve_requests
    for obj in out:
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 287, in chat_completion
    yield from self.generate(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    response = gen.send(None)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 205, in generate
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 321, in forward
    h = layer(h, start_pos, freqs_cis, mask)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 268, in forward
    out = h + self.feed_forward(self.ffn_norm(h))
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/loader.py", line 43, in swiglu_wrapper
    out = ffn_swiglu(x, self.w1.weight, self.w3.weight, self.w2.weight)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 62, in ffn_swiglu
    return ffn_swiglu_fp8_dynamic(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 165, in ffn_swiglu_fp8_dynamic
    x1 = fc_fp8_dynamic(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 146, in fc_fp8_dynamic
    y = torch.ops.fbgemm.f8f8bf16_rowwise(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: cutlass cannot initialize
[debug] got exception cutlass cannot initialize
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7ebc8462f0d0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
  |     raise exc
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
  |     await response(scope, receive, send)
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 258, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
    |     await func()
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 84, in sse_generator
    |     async for event in event_gen:
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 94, in event_gen
    |     async for event in InferenceApiInstance.chat_completion(exec_request):
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/inference.py", line 58, in chat_completion
    |     for token_result in self.generator.chat_completion(
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/model_parallel.py", line 104, in chat_completion
    |     yield from gen
    |   File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 255, in run_inference
    |     raise obj
    | RuntimeError: cutlass cannot initialize
    +------------------------------------
^CW0729 16:50:42.785000 139352429928448 torch/distributed/elastic/agent/server/api.py:688] Received Signals.SIGINT death signal, shutting down workers
W0729 16:50:42.785000 139352429928448 torch/distributed/elastic/multiprocessing/api.py:734] Closing process 20066 via signal SIGINT
Exception ignored in: <function Context.__del__ at 0x7ebc85d2e950>
Traceback (most recent call last):
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 142, in __del__
    self.destroy()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 324, in destroy
    self.term()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 266, in term
    super().term()
  File "_zmq.py", line 545, in zmq.backend.cython._zmq.Context.term
  File "_zmq.py", line 141, in zmq.backend.cython._zmq._check_rc
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20066 got signal: 2
INFO:     Shutting down
Process ForkProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group
    elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20064 got signal: 2
INFO:     Waiting for application shutdown.
shutting down
INFO:     Application shutdown complete.
INFO:     Finished server process [20033]
SIGINT or CTRL-C detected. Exiting gracefully (2, <frame at 0x7ebc8644bc40, file '/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py', line 328, code capture_signals>)
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/trainer/.llama/venv/bin/llama", line 8, in <module>
    sys.exit(main())
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 54, in main
    parser.run(args)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 48, in run
    args.func(args)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/inference/start.py", line 53, in _run_inference_start_cmd
    inference_server_init(
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 115, in main
    uvicorn.run(app, host=listen_host, port=port)
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/main.py", line 577, in run
    server.run()
  File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/usr/lib/python3.10/asyncio/runners.py", line 48, in run
    loop.run_until_complete(loop.shutdown_asyncgens())
  File "uvloop/loop.pyx", line 1515, in uvloop.loop.Loop.run_until_complete
RuntimeError: Event loop stopped before Future completed.

I will appreciate any help and sugggestion. Thank you in advance.

huanpengchu commented 1 month ago

I have the same problem

ashwinb commented 1 month ago

This tends to be a symptom of things going OOM when you make the request. To isolate this, can you disable safety first and only run inference and see if things successfully run?

Also, note that agentic-system and toolchain interfaces / contracts have now changed. Please take a look at the updates (specifically using the llama distribution start CLI command, etc.)