Tested with ROSA and RHODS 2.4 and it works as expected:
$ oc get deploy -n llms hf-text-generation-inference-server -o jsonpath='{.spec.template.spec.containers[0].image}'
ghcr.io/huggingface/text-generation-inference:1.2.0
For the inference used the default Flan-T5-xl model:
curl https://hf-text-generation-inference-server-llms.apps.rosa-xxxx.xxxx.p1.openshiftapps.com/generate \
-X POST \
-d '{"inputs":"What is the capital from Italy?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
{"generated_text":"rome"}
The info model of these tests are available with /info:
curl https://hf-text-generation-inference-server-llms.apps.rosa-xxx.xxx.p1.openshiftapps.com/info \
-X GET | jq -r .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 506 100 506 0 0 962 0 --:--:-- --:--:-- --:--:-- 974
{
"model_id": "google/flan-t5-xl",
"model_sha": "7d6315df2c2fb742f0f5b556879d730926ca9001",
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": "text2text-generation",
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1024,
"max_total_tokens": 2048,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 16000,
"max_waiting_tokens": 20,
"validation_workers": 2,
"version": "1.2.0",
"sha": "ccd5725a0c0b2ef151d317c86d1f52ad038bbae9",
"docker_label": "sha-ccd5725"
}
Due to the HFTGI v1.2.0 is released, bumping the version of the container released (https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference/153122556?tag=1.2.0).
Tested with ROSA and RHODS 2.4 and it works as expected:
For the inference used the default Flan-T5-xl model:
The info model of these tests are available with /info: