triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Can't deploy multiple version of BERT. #61

Closed ogis-uno closed 1 year ago

ogis-uno commented 1 year ago

Description

I tried to deploy multiple versions of BERT. For that, I removed "default_model_filename" and "model_checkpoint_path" from config.pbtxt. But when started up triton, I got many warning messages like follows.

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.dense.bias.bin cannot be opened, loading model fails! 

Environments.

fastertransformer_backend: 225b57898b830
FasterTransformer: f59e237c247
Docker version: 20.10.17, build 100c701
NVIDIA Driver Version: 515.65.01
GPU: Tesla T4 x 1

Reproduced Steps

1. clone fastertransformer_backend repo.

$ git clone https://github.com/triton-inference-server/fastertransformer_backend.git
$ cd fastertransformer_backend
$ export WORKSPACE=$(pwd)
$ export CONTAINER_VERSION=22.07
$ export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
  1. Build container image.
$ docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .
  1. Run container.
$ docker run -it --rm --gpus=all -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
  1. Convert Huggingface's BERT (in container)
# export WORKSPACE=$(pwd)

# sudo apt-get install git-lfs
# git lfs install
# git lfs clone https://huggingface.co/bert-base-uncased # Download model from huggingface
# git clone https://github.com/NVIDIA/FasterTransformer.git # To convert checkpoint
# export PYTHONPATH=${WORKSPACE}/FasterTransformer:${PYTHONPATH}
# python3 FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
        -in_file bert-base-uncased/ \
        -saved_dir ${WORKSPACE}/all_models/bert/fastertransformer/1/ \
        -infer_tensor_para_size 1
  1. Modify config.pbtxt (Change "tensor/pipeline_para_size" to 1, and remove "default_model_filename" and "model_checkpoint_path")
# sed -i -e 's/string_value: "2"/string_value: "1"/' -e "30d" -e "88,93d" all_models/bert/fastertransformer/config.pbtxt

# git diff all_models/bert/fastertransformer/config.pbtxt 
diff --git a/all_models/bert/fastertransformer/config.pbtxt b/all_models/bert/fastertransformer/config.pbtxt
index e18d66f..3a8ed02 100644
--- a/all_models/bert/fastertransformer/config.pbtxt
+++ b/all_models/bert/fastertransformer/config.pbtxt
@@ -27,7 +27,6 @@

 name: "fastertransformer"
 backend: "fastertransformer"
-default_model_filename: "bert"
 max_batch_size: 1024
 input [
   {
@@ -58,13 +57,13 @@ instance_group [
 parameters {
   key: "tensor_para_size"
   value: {
-    string_value: "2"
+    string_value: "1"
   }
 }
 parameters {
   key: "pipeline_para_size"
   value: {
-    string_value: "2"
+    string_value: "1"
   }
 }
 parameters {
@@ -85,12 +84,6 @@ parameters {
     string_value: "bert"
   }
 }
-parameters {
-  key: "model_checkpoint_path"
-  value: {
-    string_value: "../all_models/bert/fastertransformer/1/2-gpu/"
-  }
-}
 parameters {
   key: "int8_mode"
   value: {
  1. Start Triton and got warnings.
# ls all_models/bert/fastertransformer/
1  config.pbtxt
# ls all_models/bert/fastertransformer/1
1-gpu

# CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/ 
...
[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.dense.bias.bin cannot be opened, loading model fails! 
...
byshiue commented 1 year ago

model_checkpoint_path is used to load the model. You cannot remove it.

ogis-uno commented 1 year ago

Thanks for reply!

model_checkpoint_path is used to load the model. You cannot remove it.

Hmm, I checked libfastertransformer.cc.

If I don't set both default_model_filename and model_checkpoint_path, model_dir will be ${repository_path}/${version}/${tensor_para_size}-gpu

and triton tries to load from model_dir.

https://github.com/triton-inference-server/fastertransformer_backend/blob/225b57898b830a13b5634ee10b812c96bad802b0/src/libfastertransformer.cc#L254-L266

I think cause of warning is bellow, adding "/" to the head of "model.encoder.layer." seems to correct the problem.

https://github.com/NVIDIA/FasterTransformer/blob/f59e237c247e030f2d57c09b4820d2ee3be693da/src/fastertransformer/models/bert/BertWeight.h#L142-L143

In T5, where loading model parameter is like bellow. and I can load T5 model without model_checkpoint_path.

https://github.com/NVIDIA/FasterTransformer/blob/f59e237c247e030f2d57c09b4820d2ee3be693da/src/fastertransformer/models/t5/T5EncoderWeight.cc#L241-L242

byshiue commented 1 year ago

model_dir is get by model_checkpoint_path

 std::string model_dir = 
     param_get("model_checkpoint_path") == "" 
         ? JoinPath( 
               {RepositoryPath(), std::to_string(Version()), model_filename}) 
         : param_get("model_checkpoint_path"); 
ogis-uno commented 1 year ago

model_dir is get by model_checkpoint_path

So you mean, if I want model_dir from JoinPath( {RepositoryPath(), std::to_string(Version()), model_filename}) I should set model_checkpoint_path like bellow?

parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: ""
  }
}
byshiue commented 1 year ago

You should set it to the checkpoint you put like

parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "../all_models/bert/fastertransformer/1/2-gpu/"
  }
}
ogis-uno commented 1 year ago

Thank you for reply. So, You mean I MUST set model_checkpoint_path to the directory which includes config.ini and model.encoder.layer*.bin as a required parameter?

If so, one more question. (Sorry for bothering you.) what is the ternary operator on line 264 is for ? Something wrong will happen if the line 265 is executed?

https://github.com/triton-inference-server/fastertransformer_backend/blob/225b57898b830a13b5634ee10b812c96bad802b0/src/libfastertransformer.cc#L262-L266

byshiue commented 1 year ago

config.ini is necessary (it is used to setup the model hyper-parameters), but model.encoder.layer*.bin are not (They are weights. If the program does not find them, it will generate random weights automatically).

In line 264, if you don't set model_checkpoint_path, it will try to load model from a default path JoinPath({RepositoryPath(), std::to_string(Version()), model_filename}).

ogis-uno commented 1 year ago

Hi, Thank you for answer. and sorry for a bit long reply.

config.ini is necessary (it is used to setup the model hyper-parameters), but model.encoder.layer*.bin are not (They are weights. If the program does not find them, it will generate random weights automatically).

I think that's what happened in my case. and randomly generated weights are useless for inference.

In line 264, if you don't set model_checkpoint_path, it will try to load model from a default path JoinPath({RepositoryPath(), std::to_string(Version()), model_filename}).

Yes, it's what I want to do. My intention is like bellow. In this configuration, I can't write model_checkpoint_path in config.pbtxt.

Current my directory structure is bellow. I think config.ini and it' weight exists in correct place. And I run 3 test with / without model_checkpoint_path again.

# model-repository would be ...
root@7d6490ff95ca:/home/uno/fastertransformer_backend# echo ${WORKSPACE}/all_models/bert/ 
/home/uno/fastertransformer_backend/all_models/bert/

# config.pbtxt exits in fastertransformer under model-repository.
root@7d6490ff95ca:/home/uno/fastertransformer_backend# ls all_models/bert/fastertransformer/
1  config.pbtxt

# Version 1 of fastertransformer  has it's content
root@7d6490ff95ca:/home/uno/fastertransformer_backend# ls all_models/bert/fastertransformer/1
1-gpu

# Contents of version 1 of fastertransformer.  
# It has config.ini and model parameters(model.encoder.layer.*.bin)
root@7d6490ff95ca:/home/uno/fastertransformer_backend# ls all_models/bert/fastertransformer/1/1-gpu/ | head -4
config.ini
model.encoder.layer.0.attention.output.LayerNorm.bias.bin
model.encoder.layer.0.attention.output.LayerNorm.weight.bin
model.encoder.layer.0.attention.output.dense.bias.bin

Case 0. Start up triton with model_checkpoint_path. as you suggests. Yes, Triton started up without warnings. it can find config.ini and model weights.

root@7d6490ff95ca:/home/uno/fastertransformer_backend# tail -6 all_models/bert/fastertransformer/config.pbtxt 
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpu/"
  }
}

root@7d6490ff95ca:/home/uno/fastertransformer_backend#  CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/
I1027 00:21:22.152144 169 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f733a000000' with size 268435456
I1027 00:21:22.154672 169 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1027 00:21:22.161613 169 model_repository_manager.cc:1206] loading: fastertransformer:1
I1027 00:21:22.234861 169 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1027 00:21:22.234888 169 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1027 00:21:22.234900 169 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1027 00:21:22.234940 169 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1027 00:21:22.237616 169 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1027 00:21:22.237650 169 libfastertransformer.cc:248] Sequence Batching: disabled
I1027 00:21:22.237748 169 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 14.43 GB, total: 14.62 GB, used:  0.20 GB
I1027 00:21:24.061989 169 libfastertransformer.cc:430] After Loading Weights:
after allocation    : free: 14.19 GB, total: 14.62 GB, used:  0.43 GB
...

Case 1. Start up triton with model_checkpoint_path but without trailing slash. ("1-gpu" not "1-gpu/") I got warnings. Triton can find config.ini but can't model weights.

root@7d6490ff95ca:/home/uno/fastertransformer_backend# tail -6 all_models/bert/fastertransformer/config.pbtxt 
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpu"
  }
}

root@7d6490ff95ca:/home/uno/fastertransformer_backend# CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/
I1027 00:24:54.048687 213 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fb072000000' with size 268435456
I1027 00:24:54.051236 213 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1027 00:24:54.058875 213 model_repository_manager.cc:1206] loading: fastertransformer:1
I1027 00:24:54.130787 213 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1027 00:24:54.130815 213 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1027 00:24:54.130829 213 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1027 00:24:54.130865 213 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1027 00:24:54.132790 213 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1027 00:24:54.132813 213 libfastertransformer.cc:248] Sequence Batching: disabled
I1027 00:24:54.132906 213 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 14.43 GB, total: 14.62 GB, used:  0.20 GB
[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 
...

Case 2. Start up triton without model_checkpoint_path. (what I want to do) Save result as case 1, I got warnings. Triton can find config.ini but can't model weights.

root@7d6490ff95ca:/home/uno/fastertransformer_backend# grep -e "model_checkpoint_path" all_models/bert/fastertransformer/config.pbtxt
root@7d6490ff95ca:/home/uno/fastertransformer_backend#

root@7d6490ff95ca:/home/uno/fastertransformer_backend# CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/
I1027 00:13:01.509585 118 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f40aa000000' with size 268435456
I1027 00:13:01.513728 118 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1027 00:13:01.542854 118 model_repository_manager.cc:1206] loading: fastertransformer:1
I1027 00:13:02.022773 118 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1027 00:13:02.022823 118 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1027 00:13:02.022833 118 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1027 00:13:02.022949 118 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1027 00:13:02.028285 118 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1027 00:13:02.028325 118 libfastertransformer.cc:248] Sequence Batching: disabled
I1027 00:13:02.031406 118 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 14.43 GB, total: 14.62 GB, used:  0.20 GB
[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 
...

I think the cause of warnings is

https://github.com/NVIDIA/FasterTransformer/blob/f59e237c247e030f2d57c09b4820d2ee3be693da/src/fastertransformer/models/bert/BertWeight.h#L139-L143

byshiue commented 1 year ago

Got it. I have updated the FT codes. Can you rebuild the docker again?

ogis-uno commented 1 year ago

Got it. I have updated the FT codes. Can you rebuild the docker again?

I have rebuild docker and re-run test again. All Case 0, 1, 2 works fine without "loading model fails!" warnings.

Thank you for your help!