Add dtype management in inference endpoints

For dtype = float32/bfloat16/float16, we need to change the image creation to

                image = {
                    "health_route": "/health",
                    "env": {
                        # Documentaiton: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher
                        "MAX_BATCH_PREFILL_TOKENS": "2048",
                        "MAX_INPUT_LENGTH": "2047",
                        "MAX_TOTAL_TOKENS": "2048",
                        "MODEL_ID": "/repository",
                    },
                    "url": "ghcr.io/huggingface/text-generation-inference:1.1.0",
                }
                if config.model_dtype is not None:
                    image["env"]["DTYPE"] = str(config.model_dtype)

For quantization, it's --quantize bitsandbytes variations.

Full options are here.

huggingface / lighteval

Add dtype management in inference endpoints #117