[Enhancement]Add external LLM service support

mymusise commented 3 weeks ago

Hi, this modification adds the capability to use external LLM services, such as deploying LLM with TGI to accelerate inference. In my tests, there is a 6x speed improvement on the H100, and on the A10g, the average response time is only 50 seconds.

For example:

First, we can deploy the LLM using TGI through Docker:

port=8080
modelID=lllyasviel/omost-llama-3-8b
memoryRate=0.9 # Normal operation requires 20GB of VRAM, adjust the ratio according to the VRAM of the deployment machine
volume=$HOME/.cache/huggingface/hub # Model cache files

docker run --gpus all -p $port:80 \
    -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --model-id $modelID --max-total-tokens 9216 --cuda-memory-fraction $memoryRate

Then, test if the LLM service has successfully started.

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Omost?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Next, add an Omost LLM HTTP Server node and enter the service address of the LLM.

mymusise commented 3 weeks ago

Sorry guy, there was a small issue with the previous instructions (fixed). If you get stuck on docker model download, you can try adding a proxy. Once the service starts successfully, there will be a Connected log.

Here is the complete log if successful start:

2024-06-08T07:52:14.960070Z INFO text_generation_launcher: Args { model_id: "lllyasviel/omost-llama-3-8b", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: Some( 9216, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "984a7c9927a7", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.9, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, } 2024-06-08T07:52:14.963934Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token" 2024-06-08T07:52:15.034587Z INFO text_generation_launcher: Default `max_input_tokens` to 4095 2024-06-08T07:52:15.034603Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145 2024-06-08T07:52:15.034605Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-06-08T07:52:15.034670Z INFO download: text_generation_launcher: Starting download process. 2024-06-08T07:52:22.777037Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-06-08T07:52:23.944873Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-06-08T07:52:23.945022Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-06-08T07:52:33.954188Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 ... 2024-06-08T07:54:24.058199Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-08T07:54:30.667978Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-06-08T07:54:30.764514Z INFO shard-manager: text_generation_launcher: Shard ready in 126.819003126s rank=0 2024-06-08T07:54:30.849606Z INFO text_generation_launcher: Starting Webserver 2024-06-08T07:54:30.944583Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API 2024-06-08T07:54:30.949210Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token" 2024-06-08T07:54:31.115560Z INFO text_generation_router: router/src/main.rs:474: Serving revision 596a5a55a1dea599afd7b379ca16687c52c7 of model lllyasviel/omost-llama-3-8b 2024-06-08T07:54:31.384286Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|begin_of_text|>' was expected to have ID '128000' but was given ID 'None' 2024-06-08T07:54:31.384319Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None' 2024-06-08T07:54:31.384322Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_0|>' was expected to have ID '128002' but was given ID 'None' 2024-06-08T07:54:31.384324Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_1|>' was expected to have ID '128003' but was given ID 'None' 2024-06-08T07:54:31.384326Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_2|>' was expected to have ID '128004' but was given ID 'None' 2024-06-08T07:54:31.384328Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_3|>' was expected to have ID '128005' but was given ID 'None' 2024-06-08T07:54:31.384339Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|start_header_id|>' was expected to have ID '128006' but was given ID 'None' 2024-06-08T07:54:31.384341Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|end_header_id|>' was expected to have ID '128007' but was given ID 'None' 2024-06-08T07:54:31.384343Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_4|>' was expected to have ID '128008' but was given ID 'None' 2024-06-08T07:54:31.384345Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|eot_id|>' was expected to have ID '128009' but was given ID 'None' 2024-06-08T07:54:31.384347Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_5|>' was expected to have ID '128010' but was given ID 'None' ... 2024-06-08T07:54:31.385116Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_250|>' was expected to have ID '128255' but was given ID 'None' 2024-06-08T07:54:31.388631Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Llama) 2024-06-08T07:54:31.403782Z INFO text_generation_router: router/src/main.rs:317: Warming up model 2024-06-08T07:54:33.856453Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32] 2024-06-08T07:54:34.979301Z INFO text_generation_router: router/src/main.rs:354: Setting max batch total tokens to 20576 2024-06-08T07:54:34.979321Z INFO text_generation_router: router/src/main.rs:355: Connected

Alternatively, if you have used AutoModelForCausalLM.from_pretrained to obtain the model locally before, the model files should be cached locally. You can try mounting the cache directory to the Docker container.

For example, on my Linux machine, the original Hugging Face cache path is /home/ubuntu/.cache/huggingface/hub.

port=8080
modelID=lllyasviel/omost-llama-3-8b
memoryRate=0.9
volume=/home/ubuntu/.cache/huggingface/hub

docker run --gpus all -p $port:80 \
    -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --model-id $modelID --max-total-tokens 9216 --cuda-memory-fraction $memoryRate

Then, we can get the response with curl:

(base) ➜ curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Omost?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
{"generated_text":" Deep Omost is a comprehensive, non-invasive, and evidence-based treatment approach that targets the root"}%

mymusise commented 3 weeks ago

I feel like this feature need to be better documented.

sure! let me detail it

mymusise commented 3 weeks ago

I feel like this feature need to be better documented.

sure! let me detail it

Hi @huchenlei, I have updated the usage in the README. Please try again, feel free to let me know if there are any issues.

huchenlei commented 3 weeks ago

Thanks for addressing that! I will give it a test tomorrow

huchenlei / ComfyUI_omost

[Enhancement]Add external LLM service support #25