rh-aiservices-bu / llm-on-openshift

Resources, demos, recipes,... to work with LLMs on OpenShift with OpenShift AI or Open Data Hub.
Apache License 2.0
90 stars 86 forks source link

Bump HFTGI version #25

Closed rcarrata closed 9 months ago

rcarrata commented 10 months ago

Due to the HFTGI v1.2.0 is released, bumping the version of the container released (https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference/153122556?tag=1.2.0).

Tested with ROSA and RHODS 2.4 and it works as expected:

$ oc get deploy -n llms hf-text-generation-inference-server -o jsonpath='{.spec.template.spec.containers[0].image}'
ghcr.io/huggingface/text-generation-inference:1.2.0

For the inference used the default Flan-T5-xl model:

curl https://hf-text-generation-inference-server-llms.apps.rosa-xxxx.xxxx.p1.openshiftapps.com/generate \
    -X POST \
    -d '{"inputs":"What is the capital from Italy?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
{"generated_text":"rome"}

The info model of these tests are available with /info:

curl https://hf-text-generation-inference-server-llms.apps.rosa-xxx.xxx.p1.openshiftapps.com/info \
    -X GET | jq -r .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   506  100   506    0     0    962      0 --:--:-- --:--:-- --:--:--   974
{
  "model_id": "google/flan-t5-xl",
  "model_sha": "7d6315df2c2fb742f0f5b556879d730926ca9001",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text2text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 16000,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "1.2.0",
  "sha": "ccd5725a0c0b2ef151d317c86d1f52ad038bbae9",
  "docker_label": "sha-ccd5725"
}