huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.03k stars 1.06k forks source link

supplying docker containers parameters but not being read #540

Closed ApoorveK closed 1 year ago

ApoorveK commented 1 year ago

System Info

Currently running trying to run the server in docker format with CPU support (with --disable-custom-kernels) with default model (bigscience/bloom-560m). the server is working smoothly in single docker container basis which is accessible through text-generation package (python) as per given documentation. So the plan was to deploy docker swarm having multiple instances of server with different models (kind of centralised LLM server)

Information

Tasks

Reproduction

Currently trying to deploy the docker swarm with text-generation-inference as service, using following docker compose yaml file: dockerSwarm.txt {using .txt format but you can rename it, changing last extension with .yml}

And using following commands to start the docker stack: docker swarm init --advertise-addr 127.0.0.1 docker stack deploy -c dockerSwarm.yml llm_server

and getting following logs inside the docker service using following command:

docker service logs llm_server_llm_bloom -f

Output:

llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | error: unexpected argument '--model-id bigscience/bloom-560m' found
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | error: unexpected argument '--model-id bigscience/bloom-560m' found
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | 
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | 
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    |   tip: a similar argument exists: '--model-id'
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    |   tip: a similar argument exists: '--model-id'
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | 
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | 
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | Usage: text-generation-launcher <--model-id <MODEL_ID>|--revision <REVISION>|--sharded <SHARDED>|--num-shard <NUM_SHARD>|--quantize <QUANTIZE>|--trust-remote-code|--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>|--max-best-of <MAX_BEST_OF>|--max-stop-sequences <MAX_STOP_SEQUENCES>|--max-input-length <MAX_INPUT_LENGTH>|--max-total-tokens <MAX_TOTAL_TOKENS>|--max-batch-size <MAX_BATCH_SIZE>|--waiting-served-ratio <WAITING_SERVED_RATIO>|--max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>|--max-waiting-tokens <MAX_WAITING_TOKENS>|--port <PORT>|--shard-uds-path <SHARD_UDS_PATH>|--master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>|--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>|--weights-cache-override <WEIGHTS_CACHE_OVERRIDE>|--disable-custom-kernels|--json-output|--otlp-endpoint <OTLP_ENDPOINT>|--cors-allow-origin <CORS_ALLOW_ORIGIN>|--watermark-gamma <WATERMARK_GAMMA>|--watermark-delta <WATERMARK_DELTA>|--env>
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    |

Expected behavior

Docker swarm should start the docker services where currently 2 text-generation-inference servers are being deployed.

Narsil commented 1 year ago

You yaml file doesn't properly split the arguments.

"--model-id", "bigscience/bloom-560m" is what you want to send to you process.

Also this is NOT an officially supported command, please tick the correct box next time :)

ApoorveK commented 1 year ago

Thank you @Narsil for suggestion and I will keep it in mind. Can you also suggest way to add shm-size for services which i am unable give with the help of following docker swarm file.

version: "3.8"

services:
  llm_bloom:
    # build:
    #   context: .
    #   args:
    #     model: bigscience/bloom-560m
    #     num_shard: 2
    image: ghcr.io/huggingface/text-generation-inference:0.8
    ports:
      - 8089:80
    volumes:
      - type: volume
        source: mydata
        target: /data
    command: ["--shm-size","1g","--model-id","bigscience/bloom-560m","--disable-custom-kernels","--num_shard","2","--max-concurrent-requests","128"]
    deploy:
      replicas: 1

  llm_bloom_quantized:
    # build:
    #   context: .
    #   args:
    #     model: bigscience/bloom-560m
    #     num_shard: 2
    image: ghcr.io/huggingface/text-generation-inference:0.8
    ports:
      - 8099:80
    volumes:
      - type: volume
        source: mydata
        target: /data
    command: ["--shm-size","1g","--model-id","bigscience/bloom-560m","--disable-custom-kernels","--num_shard","2","--max-concurrent-requests","128","--quantize","bitsandbytes"]
      # [possible values: bitsandbytes, gptq]
    deploy:
      replicas: 1

volumes:
  mydata:
# volumes:
#       - type: tmpfs
#         target: /dev/shm
#         tmpfs:
#            size: 4096000000 # (this means 4GB)
  # shm:
  #   driver: local
  #   driver_opts:
  #     type: tmpfs
  #     mount_options:
  #       - size="4G"
  # - type: tmpfs
  #   target: /dev/shm
  #   tmpfs:
  #       size: 4096000000 # (this means 4GB)
Narsil commented 1 year ago

Sorry I never used swarm.