llama stack run failed with AssertionError: Could not find checkpoint dir

kun432 commented 3 days ago

using pyenv + venv + Docker, llama stack run failed and seems cannot found model directory

$ llama stack run my-local-stack
+ '[' -n '' ']'
+ '[' -z '' ']'
+ docker run -it -p 5000:5000 -v /home/kun432/.llama/builds/docker/my-local-stack-run.yaml:/app/config.yaml llamastack-my-local-stack python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5000
router_api Api.inference
router_api Api.safety
router_api Api.memory
Resolved 8 providers in topological order
  Api.models: routing_table
  Api.inference: router
  Api.shields: routing_table
  Api.safety: router
  Api.memory_banks: routing_table
  Api.memory: router
  Api.agents: meta-reference
  Api.telemetry: meta-reference

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 507, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 428, in main
    impls, specs = asyncio.run(resolve_impls_with_routing(config))
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 406, in resolve_impls_with_routing
    impl = await instantiate_provider(spec, deps, configs[api])
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/utils/dynamic.py", line 53, in instantiate_provider
    impl = await instantiate_provider(
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/utils/dynamic.py", line 71, in instantiate_provider
    impl = await fn(*args)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/__init__.py", line 18, in get_provider_impl
    await impl.initialize()
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/inference.py", line 49, in initialize
    self.generator = LlamaModelParallelGenerator(self.config)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/model_parallel.py", line 70, in __init__
    checkpoint_dir = model_checkpoint_dir(self.model)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/generation.py", line 54, in model_checkpoint_dir
    assert checkpoint_dir.exists(), (
AssertionError: Could not find checkpoint dir: /root/.llama/checkpoints/Llama3.1-8B-Instruct/original.Please download model using `llama download Llama3.1-8B-Instruct`
++ error_handler 55
++ echo 'Error occurred in script at line: 55'
Error occurred in script at line: 55
++ exit 1

models has already been downloaded like this:

$ llama download --source huggingface --model-id Llama3.1-8B-Instruct --hf-token XXXXXXXXXX
$ llama download --source huggingface --model-id Llama-Guard-3-8B --hf-token XXXXXXXXXX
$ llama download --source huggingface --model-id Prompt-Guard-86M --hf-token XXXXXXXXXX

$ ls  ~/.llama/checkpoints
Llama-Guard-3-8B  Llama3.1-8B-Instruct  Prompt-Guard-86M

error message seems it comes from Docker and it cannot find checkpoint dir from the inside of Docker, I guess. Did I miss something?

yanxi0830 commented 3 days ago

This is because you need to mount the checkpoint directory while spinning up docker. Setting $LLAMA_CHECKPOINT_DIR and then llama stack run should work.

export LLAMA_CHECKPOINT_DIR=~/.llama

This will mount the checkpoint directory while spinning up docker container w/ command

docker run -it -p 5000:5000 -v $LLAMA_CHECKPOINT_DIR:/root/.llama -v /home/kun432/.llama/builds/docker/my-local-stack-run.yaml:/app/config.yaml llamastack-my-local-stack python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5000

kun432 commented 3 days ago

thanks! it works.

also, this should be on documents.

meta-llama / llama-stack

llama stack run failed with AssertionError: Could not find checkpoint dir #125