I am puzzled as to why stack needs to link it to the address [: ffff: 0.0.2.208]

Itime-ren commented 1 week ago

downloaded the 1B model from Huggingface and encountered an error while running it. The following is the configuration process, and I am puzzled as to why I need to link it to the address [: ffff: 0.0.2.208]: 48461

(llamastack-stack-3.2-1B) root@720:~/.llama/checkpoints# llama stack configure  stack-3.2-1B
Could not find stack-3.2-1B. Trying conda build name instead...
Configuring API `inference`...
=== Configuring provider `meta-reference` for API inference...
Enter value for model (default: Llama3.1-8B-Instruct) (required): Llama3.2-1B-Instruct
Do you want to configure quantization? (y/n): n
Enter value for torch_seed (optional):
Enter value for max_seq_len (default: 4096) (required):
Enter value for max_batch_size (default: 1) (required):

Configuring API `safety`...
=== Configuring provider `meta-reference` for API safety...
Do you want to configure llama_guard_shield? (y/n): n
Enter value for enable_prompt_guard (default: False) (optional):

Configuring API `agents`...
=== Configuring provider `meta-reference` for API agents...
Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):

Configuring SqliteKVStoreConfig:
Enter value for namespace (optional):
Enter value for db_path (default: /root/.llama/runtime/kvstore.db) (required):

Configuring API `memory`...
=== Configuring provider `meta-reference` for API memory...
> Please enter the supported memory bank type your provider has for memory: vector

Configuring API `telemetry`...
=== Configuring provider `meta-reference` for API telemetry...

> YAML configuration has been written to `/root/.llama/builds/conda/stack-3.2-1B-run.yaml`.
You can now run `llama stack run stack-3.2-1B --port PORT`

(llamastack-stack-3.2-1B) root@720:~/.llama/checkpoints# llama stack run stack-3.2-1B --port 5000 --disable-ipv6
Resolved 8 providers in topological order
  Api.models: routing_table
  Api.inference: router
  Api.shields: routing_table
  Api.safety: router
  Api.memory_banks: routing_table
  Api.memory: router
  Api.agents: meta-reference
  Api.telemetry: meta-reference

d/elastic/multiprocessing/api.py:702]     store, rank, world_size = next(rendezvous_iterator)
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702]   File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702]     store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702]   File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702]     return TCPStore(
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702] torch.distributed.DistNetworkError: The client socket has failed to connect to any network address of (720, 48461). The client socket has failed to connect to 0.0.2.208:48461 (errno: 110 - Connection timed out).
E1006 02:22:05.730000 139705575060544 torch/distributed/elastic/multiprocessing/api.py:702]
Process ForkProcess-1:
Traceback (most recent call last):
  File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 175, in launch_dist_group
    elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
  File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
worker_process_entrypoint FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-06_02:22:05
  host      : 720
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3509)
  error_file: /tmp/torchelastic_silhk6ot/39495240-fe63-4883-ba91-9be482efe3ba_hewowew5/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 131, in worker_process_entrypoint
      model = init_model_cb()
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/model_parallel.py", line 50, in init_model_cb
      llama = Llama.build(config)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/generation.py", line 90, in build
      torch.distributed.init_process_group("nccl")
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
      return func(*args, **kwargs)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
      func_return = func(*args, **kwargs)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
      store, rank, world_size = next(rendezvous_iterator)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
      store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
    File "/home/miniconda3/envs/llamastack-stack-3.2-1B/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
      return TCPStore(
  torch.distributed.DistNetworkError: The client socket has failed to connect to any network address of (720, 48461). The client socket has failed to connect to 0.0.2.208:48461 (errno: 110 - Connection timed out).

russellb commented 1 week ago

@Itime-ren Try adding --disable-ipv6 to your llama stack run command.

Itime-ren commented 1 week ago

(llamastack-stack-3.2-1B) root@720:~/.llama/checkpoints# llama stack run stack-3.2-1B --port 5000 --disable-ipv6 It doesn't work,no difference

raghotham commented 1 week ago

Can you share the full run.yaml? Seems to be in this location /root/.llama/builds/conda/stack-3.2-1B-run.yaml

Itime-ren commented 1 week ago

Can you share the full run.yaml? Seems to be in this location /root/.llama/builds/conda/stack-3.2-1B-run.yaml

I have read the source code and saw that the default port is 5000. I ran “ llama stack run stack-3.2-1B --port 5000 -- disable-ipv6”. My system's 5000 port was not occupied, but the result was still dynamically assigned ports and IPv6 was enabled by default, The detailed system output is as described above

/root/.llama/builds/conda/stack-3.2-1B-run.yaml，as follow

version: v1
built_at: '2024-10-06T02:16:39.007013'
image_name: stack-3.2-1B
docker_image: null
conda_env: stack-3.2-1B
apis_to_serve:
- memory
- inference
- safety
- models
- agents
- memory_banks
- shields
api_providers:
  inference:
    providers:
    - meta-reference
  safety:
    providers:
    - meta-reference
  agents:
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: /root/.llama/runtime/kvstore.db
  memory:
    providers:
    - meta-reference
  telemetry:
    provider_type: meta-reference
    config: {}
routing_table:
  inference:
  - provider_type: meta-reference
    config:
      model: Llama3.2-1B-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
    routing_key: Llama3.2-1B-Instruct
  safety:
  - provider_type: meta-reference
    config:
      llama_guard_shield: null
      enable_prompt_guard: false
    routing_key:
    - llama_guard
    - code_scanner_guard
    - injection_shield
    - jailbreak_shield
  memory:
   - provider_type: meta-reference
    config: {}
    routing_key: vector

meta-llama / llama-stack

I am puzzled as to why stack needs to link it to the address [: ffff: 0.0.2.208] #194