Closed jsimao71 closed 1 month ago
add option --disable-custom-kernels . docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard --disable-custom-kernels
see --disable-custom-kernels For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Deploying server as docker image in machine without GPU. Invocation of generation endpoint produces error: "error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
Deployed as docker image (tested with several models): model=${1:-bigscience/bloom-560m} num_shard=2 volume=/data/hf
docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard
curl 127.0.0.1:80/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json' {"error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
Logs:
Information
Tasks
Reproduction
Deploy server: model=bigscience/bloom-560m num_shard=2 volume=/data/hf
docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard
Call endpoint: curl 127.0.0.1:80/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json'
Expected behavior
Expect 200OK and JSON response.