supplying docker containers parameters but not being read

ApoorveK commented 1 year ago

System Info

Currently running trying to run the server in docker format with CPU support (with --disable-custom-kernels) with default model (bigscience/bloom-560m). the server is working smoothly in single docker container basis which is accessible through text-generation package (python) as per given documentation. So the plan was to deploy docker swarm having multiple instances of server with different models (kind of centralised LLM server)

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Currently trying to deploy the docker swarm with text-generation-inference as service, using following docker compose yaml file: dockerSwarm.txt {using .txt format but you can rename it, changing last extension with .yml}

And using following commands to start the docker stack: docker swarm init --advertise-addr 127.0.0.1 docker stack deploy -c dockerSwarm.yml llm_server

and getting following logs inside the docker service using following command:

docker service logs llm_server_llm_bloom -f

Output:

llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | error: unexpected argument '--model-id bigscience/bloom-560m' found
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | error: unexpected argument '--model-id bigscience/bloom-560m' found
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | 
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | 
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    |   tip: a similar argument exists: '--model-id'
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    |   tip: a similar argument exists: '--model-id'
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | 
llm_server_llm_bloom.1.kjojjuj9jczc@e2e-100-17    | 
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    | Usage: text-generation-launcher <--model-id <MODEL_ID>|--revision <REVISION>|--sharded <SHARDED>|--num-shard <NUM_SHARD>|--quantize <QUANTIZE>|--trust-remote-code|--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>|--max-best-of <MAX_BEST_OF>|--max-stop-sequences <MAX_STOP_SEQUENCES>|--max-input-length <MAX_INPUT_LENGTH>|--max-total-tokens <MAX_TOTAL_TOKENS>|--max-batch-size <MAX_BATCH_SIZE>|--waiting-served-ratio <WAITING_SERVED_RATIO>|--max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>|--max-waiting-tokens <MAX_WAITING_TOKENS>|--port <PORT>|--shard-uds-path <SHARD_UDS_PATH>|--master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>|--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>|--weights-cache-override <WEIGHTS_CACHE_OVERRIDE>|--disable-custom-kernels|--json-output|--otlp-endpoint <OTLP_ENDPOINT>|--cors-allow-origin <CORS_ALLOW_ORIGIN>|--watermark-gamma <WATERMARK_GAMMA>|--watermark-delta <WATERMARK_DELTA>|--env>
llm_server_llm_bloom.1.ophf8tfzsxk4@e2e-100-17    |

Expected behavior

Docker swarm should start the docker services where currently 2 text-generation-inference servers are being deployed.

Narsil commented 1 year ago

You yaml file doesn't properly split the arguments.

"--model-id", "bigscience/bloom-560m" is what you want to send to you process.

Also this is NOT an officially supported command, please tick the correct box next time :)

ApoorveK commented 1 year ago

Thank you @Narsil for suggestion and I will keep it in mind. Can you also suggest way to add shm-size for services which i am unable give with the help of following docker swarm file.

version: "3.8"

services:
  llm_bloom:
    # build:
    #   context: .
    #   args:
    #     model: bigscience/bloom-560m
    #     num_shard: 2
    image: ghcr.io/huggingface/text-generation-inference:0.8
    ports:
      - 8089:80
    volumes:
      - type: volume
        source: mydata
        target: /data
    command: ["--shm-size","1g","--model-id","bigscience/bloom-560m","--disable-custom-kernels","--num_shard","2","--max-concurrent-requests","128"]
    deploy:
      replicas: 1

  llm_bloom_quantized:
    # build:
    #   context: .
    #   args:
    #     model: bigscience/bloom-560m
    #     num_shard: 2
    image: ghcr.io/huggingface/text-generation-inference:0.8
    ports:
      - 8099:80
    volumes:
      - type: volume
        source: mydata
        target: /data
    command: ["--shm-size","1g","--model-id","bigscience/bloom-560m","--disable-custom-kernels","--num_shard","2","--max-concurrent-requests","128","--quantize","bitsandbytes"]
      # [possible values: bitsandbytes, gptq]
    deploy:
      replicas: 1

volumes:
  mydata:
# volumes:
#       - type: tmpfs
#         target: /dev/shm
#         tmpfs:
#            size: 4096000000 # (this means 4GB)
  # shm:
  #   driver: local
  #   driver_opts:
  #     type: tmpfs
  #     mount_options:
  #       - size="4G"
  # - type: tmpfs
  #   target: /dev/shm
  #   tmpfs:
  #       size: 4096000000 # (this means 4GB)

Narsil commented 1 year ago

Sorry I never used swarm.

huggingface / text-generation-inference