predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
2.13k stars 139 forks source link

Quickstart example not working #489

Open jmorenobl opened 4 months ago

jmorenobl commented 4 months ago

System Info

Pre-built Docker image on g4dn.xlarge with Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.2.0 (Amazon Linux 2)

Information

Tasks

Reproduction

Execute the example on the home page: model=mistralai/Mistral-7B-Instruct-v0.1 volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ ghcr.io/predibase/lorax:main --model-id $model

Expected behavior

A server starts successfully serving mistral.

rusenask commented 4 months ago

what's the error you are seeing? and logs?

jmorenobl commented 4 months ago

Just by executing this:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:main --model-id $model

I get the following error:

2024-05-27T11:40:55.235184Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, embedding_model: None, num_shard: None, quantize: None, compile: false, speculative_tokens: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "252cfb445bd6", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-05-27T11:40:55.235284Z  INFO download: lorax_launcher: Starting download process.
2024-05-27T11:40:58.738448Z ERROR download: lorax_launcher: Download encountered an error: 
Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()

  File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/mistralai/Mistral-7B-Instruct-v0.1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 124, in download_weights
    _download_weights(model_id, revision, extension, auto_convert, source, api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/weights.py", line 447, in download_weights
    model_source.weight_files()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 179, in weight_files
    return weight_files(self.model_id, self.revision, extension, self.api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 69, in weight_files
    filenames = weight_hub_files(model_id, revision, extension, api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 34, in weight_hub_files
    info = api.model_info(model_id, revision=revision)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1922, in model_info
    hf_raise_for_status(r)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6654714a-7fc4308c45580b7328298ca4;876ae9f0-b94e-4384-a4a9-fd3139261aa7)

Cannot access gated repo for url https://huggingface.co/api/models/mistralai/Mistral-7B-Instruct-v0.1.
Access to model mistralai/Mistral-7B-Instruct-v0.1 is restricted. You must be authenticated to access it.

Error: DownloadError

If I add my token I and execute it this way:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -v $volume:/data ghcr.io/predibase/lorax:main --model-id $model

I get the following error:

2024-05-27T11:45:16.081927Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, embedding_model: None, num_shard: None, quantize: None, compile: false, speculative_tokens: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "8b4a73dd40ee", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-05-27T11:45:16.082040Z  INFO download: lorax_launcher: Starting download process.
Error: DownloadError
2024-05-27T11:45:18.784725Z ERROR download: lorax_launcher: Download encountered an error: 
Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()

  File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/models/mistralai/Mistral-7B-Instruct-v0.1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 124, in download_weights
    _download_weights(model_id, revision, extension, auto_convert, source, api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/weights.py", line 447, in download_weights
    model_source.weight_files()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 179, in weight_files
    return weight_files(self.model_id, self.revision, extension, self.api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 69, in weight_files
    filenames = weight_hub_files(model_id, revision, extension, api_token)

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/sources/hub.py", line 34, in weight_hub_files
    info = api.model_info(model_id, revision=revision)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1922, in model_info
    hf_raise_for_status(r)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-6654724e-68506e57290cd75960e9177c;090b4284-8d6f-441f-916f-2fa3e5ab57c3)

Cannot access gated repo for url https://huggingface.co/api/models/mistralai/Mistral-7B-Instruct-v0.1.
Access to model mistralai/Mistral-7B-Instruct-v0.1 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 to ask for access.

It's fine since I don't have access to that model but I would use another example for a quickstart. I tried with Phi-3 since it's not a gated model but it didn't work either:

$ export model=microsoft/Phi-3-small-8k-instruct
$ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:main --model-id $model
2024-05-27T11:52:11.862867Z  INFO lorax_launcher: Args { model_id: "microsoft/Phi-3-small-8k-instruct", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, embedding_model: None, num_shard: None, quantize: None, compile: false, speculative_tokens: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "99cd7f793e22", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-05-27T11:52:11.862989Z  INFO download: lorax_launcher: Starting download process.
2024-05-27T11:52:14.407194Z  INFO lorax_launcher: hub.py:121 Download file: model-00001-of-00004.safetensors

2024-05-27T11:52:31.144230Z  INFO lorax_launcher: hub.py:130 Downloaded /data/models--microsoft--Phi-3-small-8k-instruct/snapshots/1adb635233ffce9e13385862a4111606d4382762/model-00001-of-00004.safetensors in 0:00:16.

2024-05-27T11:52:31.144324Z  INFO lorax_launcher: hub.py:150 Download: [1/4] -- ETA: 0:00:48

2024-05-27T11:52:31.156145Z  INFO lorax_launcher: hub.py:121 Download file: model-00002-of-00004.safetensors

2024-05-27T11:53:19.520065Z  INFO lorax_launcher: hub.py:130 Downloaded /data/models--microsoft--Phi-3-small-8k-instruct/snapshots/1adb635233ffce9e13385862a4111606d4382762/model-00002-of-00004.safetensors in 0:00:48.

2024-05-27T11:53:19.520245Z  INFO lorax_launcher: hub.py:150 Download: [2/4] -- ETA: 0:01:05

2024-05-27T11:53:19.520607Z  INFO lorax_launcher: hub.py:121 Download file: model-00003-of-00004.safetensors

2024-05-27T11:54:38.927900Z  INFO lorax_launcher: hub.py:130 Downloaded /data/models--microsoft--Phi-3-small-8k-instruct/snapshots/1adb635233ffce9e13385862a4111606d4382762/model-00003-of-00004.safetensors in 0:01:19.

2024-05-27T11:54:38.928020Z  INFO lorax_launcher: hub.py:150 Download: [3/4] -- ETA: 0:00:48

2024-05-27T11:54:38.928307Z  INFO lorax_launcher: hub.py:121 Download file: model-00004-of-00004.safetensors

2024-05-27T11:54:46.533325Z  INFO lorax_launcher: hub.py:130 Downloaded /data/models--microsoft--Phi-3-small-8k-instruct/snapshots/1adb635233ffce9e13385862a4111606d4382762/model-00004-of-00004.safetensors in 0:00:07.

2024-05-27T11:54:46.533440Z  INFO lorax_launcher: hub.py:150 Download: [4/4] -- ETA: 0

2024-05-27T11:54:47.004372Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-05-27T11:54:47.004674Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-05-27T11:54:52.749530Z ERROR lorax_launcher: server.py:265 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 318, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 252, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 328, in get_model
    raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type phi3small

2024-05-27T11:54:53.511126Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 318, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 252, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 328, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3small
 rank=0
2024-05-27T11:54:53.609061Z ERROR lorax_launcher: Shard 0 failed to start
2024-05-27T11:54:53.609080Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

Is there any model I could use to start playing around with LoRAX?

peterschmidt85 commented 4 months ago

@jmorenobl You just need to log in to HuggingFace Hub, open the model, accept the EULA, and then run LoRaX by passing your own HUGGING_FACE_HUB_TOKEN env variable