huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.76k stars 1.02k forks source link

Could not import SGMV kernel from Punica, falling back to loop. #2465

Open ksajan opened 2 weeks ago

ksajan commented 2 weeks ago

System Info

text-generation-launcher --env:

2024-08-28T05:17:36.254761Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: 21187c27c90acbec7f912b8af4feaec154de960f
Docker label: N/A
nvidia-smi:
N/A
xpu-smi:
N/A
2024-08-28T05:17:36.254797Z  INFO text_generation_launcher: Args {
    model_id: "bigscience/bloom-560m",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 3000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
}

No GPU using CPU version.

Information

Tasks

Reproduction

  1. Installed rust and create a virtual env with python 3.9
  2. Install Protoc
  3. cloned the github repo
  4. ran the command
    cd text-generation-inference/
    BUILD_EXTENSIONS=True make install-cpu
  5. then tried the example of running the tgi locally using falcon-7b model but after downloading it fails to load saying the error : Could not import SGMV kernel from Punica, falling back to loop.

Expected behavior

It should download the model and serve it without any error

ErikKaum commented 2 weeks ago

Hi @ksajan πŸ‘‹

Thanks for filing the issue. I think the problem is that you're running on a CPU and the falcon-7b in TGI is only supported with kernels that require a GPU.

If you want to run TGI locally on cpu to test I'd recommend choosing a smaller model that doesn't rely on special kernel. Or if you're requirements are to use something like falcon-7b then unfortunately you'll need a GPU machine.

Let me know if I can help in any other way πŸ™Œ

ksajan commented 2 weeks ago

@ErikKaum I tried running this lmsys/vicuna-7b-v1.3 as well which I can run using llama_cpp. I was trying to actually train the Medusa head that is there in the TGI documentation but I was unable to run this in google collab with GPU with a similar error.

ErikKaum commented 1 week ago

Yeah so the llama.cpp version probably uses different kernels that don't require GPUs.

When you build this for a gpu did you use: BUILD_EXTENSIONS=True make install-cpu or BUILD_EXTENSIONS=True make`?

I'd nonetheless recommend using the docker image to avoid building from source, usually a lot more hassle free πŸ‘