huggingface / tgi-gaudi

Large Language Model Text Generation Inference on Habana Gaudi
http://hf.co/docs/text-generation-inference
Apache License 2.0
25 stars 41 forks source link

Misleading documentation #174

Open 12010486 opened 3 months ago

12010486 commented 3 months ago

Hi everyone,

I can see there has been a recent effort to add more documentation on TGI, and I appreciate it. However, there are some sections that are misleading, for example: docs/source/conceptual/quantization.md it is describing Quantization with GPTQ and Quantization with bitsandbytes, but to the best of my knowledge, this is not working on Gaudi2 (we tested bitsandbytes and cuda calls are hardcoded in there).

My ask would be if you can prune the bites that are not relevant for Gaudi

12010486 commented 3 months ago

I can also contribute, if you find it relevant. We interact with customers, so might bring in a different perspective

regisss commented 2 months ago

Maybe we should make it clearer in the README that not all features of TGI are supported on Gaudi and that the doc for this fork is the README.

endomorphosis commented 2 months ago

I came here to chime in that the documentation is wrong

this example crashes during warmup

docker run -p 8080:80 \ --runtime=habana \ -v $volume:/data \ -e HABANA_VISIBLE_DEVICES=all \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ -e HF_HUB_ENABLE_HF_TRANSFER=1 \ -e HUGGING_FACE_HUB_TOKEN=$hf_token \ -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ -e PREFILL_BATCH_BUCKET_SIZE=1 \ -e BATCH_BUCKET_SIZE=256 \ -e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \ --cap-add=sys_nice \ --ipc=host \ ghcr.io/huggingface/tgi-gaudi:2.0.1 \ --model-id $model \ --max-batch-prefill-tokens 8242 \ --max-input-tokens 4096 \ --max-total-tokens 8192 \ --max-batch-size 256 \ --max-concurrent-requests 400 \ --sharded true \ --num-shard 8

regisss commented 2 months ago

@endomorphosis Can you please point me at where you find this example in the documentation? I can't find it.