linto-ai / llm-gateway

Rolling summarization using LLM
1 stars 0 forks source link

LLM Gateway (better working one i guess)

Can be started locally using after installing requirements.

python3 -m app --api_base=http://localhost:9000/v1

Tests would use

python3 -m tests

but none yet. Chunker seems very fine at least.

next head to host:port/docs for swagger and profit

Docker Compose & swarm env

docker compose up shall start vLLM with vigostral and llm-gateway in the same network

note : any modification to any servicename.json triggers a hot-reload of /services route. Prompt template (servicename.txt) is reloaded uppon any usage request

note: mount your service manifests folder (./services here) as /usr/src/services

note: Always use a string val for OPENAI_API_TOKEN.

Available Envs

vLLM backend locally

docker run --gpus=all -v ~/.cache/huggingface:/root/.cache/huggingface  -p 9000:8000     --ipc=host vllm/vllm-openai:latest --model TheBloke/Vigostral-7B-Chat-AWQ  --quantization awq

vLLM backend on the server

docker service create \
  --name vllm-service \
  --network net_llm_services \
  --mount type=bind,source=/home/linagora/shared_mount/models/,target=/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model TheBloke/Instruct_Mixtral-8x7B-v0.1_Dolly15K-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.5

Note on app/services

A new service is created using servicename.json, as a manifest of parameters. It's associated with servicename.txt that holds the prompt template

See notes below

{
    "type": "summary",
    "fields": 2,
    "name": "cra", // name of the service (route)
    "description": {
        "fr": "Compte Rendu Analytique"
    },
    "backend": "vLLM", // only one supported, we can add more easily
    "flavor": [
        {
            "name":"mixtral", // the name of the flavor to use in request
            "modelName": "TheBloke/Instruct_Mixtral-8x7B-v0.1_Dolly15K-AWQ", // Ensure you have this running on vLLM server or it will crash
            "totalContextLength": 32768, // Max Context = Prompt + User Prompt + generated Tokens
            "maxGenerationLength": 2048, // Limits the output from the model. Keep this fairly high.
            "tokenizerClass": "LlamaTokenizer",
            "createNewTurnAfter": 178, // Forces the chunker to create a new "virtual turns" whenever a turn reaches this number of tokens.
            "summaryTurns": 2, // 2 previously summarized turns will get injected to the template
            "maxNewTurns": 6, // 6 turns at max will get processed. Shall failback to less if we reach high token count (close to maxContextSize)
            "temperature": 0.1, // 0-1 : creativity, shall be close to zero as we want accurate sumpmaries
            "top_p": 0.8 // 0-1 : i.e. 0.5: only considers words that together add up to at least 50% of the total probability, leaving out the less likely ones. i.e 0.9 0.9: This includes a lot more words in the choice, allowing for more variety and originality.
        }
    ]
}