Memory usage is doubled when loading a fp16 model into bf16

skyser2003 commented 1 year ago

Description

Model: Gpt-NeoX
GPU: A100
Tritonserver version: 22.12

Hello, I'm not sure whether this is FasterTransformer's issue or backend's issue, but still I'm reporting it here.

As the title says, I have my model trained originally with fp16 on huggingface, and I converted it to FasterTransformer weight format.

This is the command I used to convert, and the size of the result folder.

python huggingface_gptneox_convert.py -o {output_dir} -i {hf_model_dir} -infer_gpu_num 1 -model_name neox_model -weight_data_type fp16

$ du -h -d 1
25G     ./1-gpu
25G     .

As the command prints out, FasterTransformer converted output weight folder is 25GB, and original huggingface model's size is also 25GB.

Problem occurs when I load it using tritonserver and fastertransformer_backend. When I load it using fp16, it just loads fine.

I0906 06:12:34.269131 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:12:56.704958 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 54.54 GB, total: 79.15 GB, used: 24.61 GB

But when I load it with bf16, it suddenly takes up twice the memory.

I0906 06:10:11.016121 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:11:07.674020 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 30.52 GB, total: 79.15 GB, used: 48.63 GB

I guess taking twice the memory means that is is loaded as fp32, so does it mean then you can't load a model saved as fp16 into bf16, or is it that just Gpt-NeoX model doesn't support bf16 format?

Reproduced Steps

In config.pbtxt

For fp16

parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}

For bf16

parameters {
  key: "data_type"
  value: {
    string_value: "bf16"
  }
}

devin12422 commented 5 months ago

Did you ever find a fix for this?

skyser2003 commented 5 months ago

@devin12422 No, since FasterTransformer is deprecated and TensorRT-LLM succeeded it, just used tensorrtllm_backend and it seemed to work fine.

triton-inference-server / fastertransformer_backend

Memory usage is doubled when loading a fp16 model into bf16 #164

Description

Reproduced Steps